atextcrawler.utils package

Submodules

atextcrawler.utils.annotation module

Convert html to plain text with annotations over character ranges.

class atextcrawler.utils.annotation.AnnotatingParser(*args, **kwargs)

Bases: html.parser.HTMLParser

Parse tagged text resulting in pure text and annotations.

The text is available in self.text and the annotations in self.annotations, which is a dict with these keys:

tags: contains a mapping of offset ranges (i, f) to the tags opening at i and closing at f

semantic_breaks: a mapping of offset positions where a new section begins to the nesting level of that sections; a section is whereever an (opening or closing) separating tag is placed in the raw html; for the separating flag of tags see tag.py

links: a mapping of hrefs to link texts obtained from anchor (a) tags; we skip hyperref with nofollow rels

section_ids: map an offset position to the first id attribute (of any tag) at the beginning of a semantic section; this can later be used in a URL fragment for linking directly into this section

Internally, we put opening tags on self.stack and pop them when the first matching closing tag is encountered. We assume balanced tags (tidy html).

NB: all tags with semantic breaks have sep=True, i.e., they will have spaces around them so that the semantic breaks always sit on a space; the semantic break position p is the end of the last section and the next sections begins at p + 1.

The text alway begins with a ‘ ‘ (added if not in the original), which is assigned a semantic break with default level 80 (if there is no semantic break tag at the beginning).

add_semantic_break(pos, lvl): Add a semantic break of level lvl at position pos.

add_tag_id(pos)

Add and clear an id if the just closing section has none yet.

pos is the start position of the current section, and the position where the id will be added.

Add an id only if we are not too far in the section’s text already.

close(): Finish by collecting results in dict self.annotations.

extract_link(i, attrs)

Add a link covering character range (i, self.pos).

From html attrs extract href and rel.

forget_tag_id(): Reset a tag id if it is too far behind in the text stream.

handle_data(text): Called for each non-tag content between tags.

handle_endtag(tag): Called for each closing tag.

handle_starttag(tag, attrs): Called for each opening tag.

atextcrawler.utils.annotation.MAX_HREF_LENGTH = 200: Maximum length of an href. Other links are discarded.

atextcrawler.utils.annotation.annotate(html): Split html text into plain text with annotations (from AnnotatingParser).

atextcrawler.utils.annotation.annotations_remove_section(annotations, i, f): Remove section (i, f) from annotations and return result.

atextcrawler.utils.annotation.clean_annotations(annotations: dict) → None: Remove void stuff from annotations.

atextcrawler.utils.annotation.cut_range(i, f, d, t_i, t_f)

Return the new coordinates of a text range (t_i,t_f) after cutting (i,f).

If (t_i,t_f) is fully within (i,f), return None, None.

atextcrawler.utils.annotation.get_tag_counts(tag_names, i, f, tags, text) → tuple[int, float, float]

Return the info on the share of characters covered with one of the tags.

Only consider the characters between i and f of string text.

Return the number of tags that have an overlap in the specified region, the tag density in the region (fraction of covered characters by all), and the average number of covered chars per tag.

NB: If more than one tag name is given, then the fractional share may exceed 1.

atextcrawler.utils.annotation.headline_probability(text, tags, lvl) → float

Estimate the probability that the text with tags is a headline.

The context is not considered: The question is not whether the text is a headline for the following text.

atextcrawler.utils.annotation.pack_annotations(annotations): Pack annotations to a special JSON string, reducing their volume a little.

atextcrawler.utils.annotation.range_overlap(i1, f1, i2, f2): Return the overlap of both ranges (None if there is none).

atextcrawler.utils.annotation.text_blacklist = ['previous', 'next', 'back', '↩︎']: Texts to ignore.

atextcrawler.utils.annotation.unpack_annotations(json_text: str) → dict: Unpack tag information from a string.

atextcrawler.utils.date_finder module

Find date expressions in a string.

atextcrawler.utils.date_finder.extract_dates(text: str, lang: Optional[str] = None) → list[datetime.datetime]: Extract dates form a string, optionally limiting formats to a language.

atextcrawler.utils.date_finder.extract_latest_date(text: str, lang: Optional[str] = None) → Optional[datetime.datetime]

Extract the latest date compatible with the lang from text.

Only consider dates in the past.

atextcrawler.utils.durl module

Hyperlink parsing.

class atextcrawler.utils.durl.Durl(url: str, base: Optional[atextcrawler.utils.durl.Durl] = None, match_base: bool = False)

Bases: object

Decomposed URL, contains urllib.parse.SplitResult.

When constructing this class, it has to be awaited, e.g.:

my_durl = await Durl(’http://www.example.com/whatever’)

The given URL will be decomposed, validated and normalized. If the URL is invalid, we return None instead of an instance.

If the given base is None, the URL must be absolute and the hostname must be valid (DNS lookup).

If the given URL is not absolute, an already decomposed (and thus valid) base Durl must be given; otherwise the URL is invalid.

The base Durl can contain a path (but no arguments or fragments), in which case the URL - if not absolute - must begin with this path.

The scheme must be http or https. If the URL begins with ‘//’, ‘http:’ is prepended.

If the hostname is longer than 90 characters, the URL is invalid.

Default port numbers (80 for http, 443 for https) are removed.

The hostname is changed to lower case. Spaces in the hostname make the URL invalid.

URL fragments are removed.

domain() → str: Return the domain of the Durl (wrong in case of second-level domains).

has_path() → bool: Return whether the Durl has a non-trivil path.

pwa() → str: Return the (base-relative) path with args of the Durl.

replace_scheme(scheme: str) → None: Replace the scheme (must be ‘http’ or ‘https’).

site() → str: Return the site (base_url).

url() → str: Return the URL as string.

async atextcrawler.utils.durl.assort_links(links: dict[str, tuple[int, int, list[str]]], durl: atextcrawler.utils.durl.Durl, text: str, base_url: Optional[str] = None) → tuple[dict[str, tuple[int, int, list[str]]], dict[atextcrawler.utils.durl.Durl, tuple[list[str], str]], dict[atextcrawler.utils.durl.Durl, tuple[list[str], str]]]

Sort links into a cleaned, an internal and an external dict.

The cleaned dict maps absolute URLs to char ranges and relations. The internal dict maps absolute URLs to relations and the linked text. The external dict maps absolute URLs to relations and the linked text. The relations are link relations, e.g. rel=”canonical”.

The base_url is set, it is used to distinguish internal and external links. If it is not set, the base_url is obtained from durl.

atextcrawler.utils.durl.get_ips(hostname: str) → set[str]: Return IPv4 and IPv6 addresses of the given hostname.

atextcrawler.utils.durl.get_url_variants(url: str) → list[str]

Return variants of the URL.

Replace http with https and vice versa; prepend or remove ‘www.’ to or from the beginning of the hostname.

atextcrawler.utils.html module

Utilities for extracting information from html.

atextcrawler.utils.html.clean_body(body)

Clean an html body.

Remove unwanted tags (keeping their content); remove empty tags; remove and replace whitespaces in several ways.

In the end the only whitespace is a space and there are no consecutive spaces.

atextcrawler.utils.html.clean_html(s: Optional[str]) → Optional[str]

Clean an html string.

Unescape htmlentities and replace whitespaces with ‘ ‘ (ASCII char 0x20).

atextcrawler.utils.html.clean_page(html)

Remove unwanted tags including their content from html.

Drop tags in drop_tags as well as tags with a role in drop_roles. Also drop tags with attribute aria-hidden=true.

Return a beautiful soup.

atextcrawler.utils.html.extract_title(html: str) → Optional[str]: Extract title tags from html returning their content as a string.

atextcrawler.utils.html.get_html_lang(html: str) → Optional[str]: Return the language, if any, found in the lang attribute of the html tag.

atextcrawler.utils.html.get_html_redirect(html: str) → Optional[str]

Return an html redirect in an http-equiv meta tag.

If none is found, return None.

atextcrawler.utils.html.whitespace_tag_tag(match_obj): Helper function for removing whitespace between tags.

atextcrawler.utils.http module

Utility functions related to http.

async atextcrawler.utils.http.get_header_links(headers: multidict._multidict.CIMultiDictProxy, durl: atextcrawler.utils.durl.Durl, site: Optional[atextcrawler.models.Site]) → dict[str, typing.Optional[str]]

Extract canonical and shortlink links from http headers.

durl must be the Durl of the fetched page and site - i fnon None - must be the Site to which the page belongs.

Return a (default)dict with ‘canonical’ and ‘shortlink’ as keys. The values default to None.

atextcrawler.utils.json module

Custom JSON encoder.

class atextcrawler.utils.json.JSONEncoderExt(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)

Bases: json.encoder.JSONEncoder

Extended JSON encoder with encoding of sets as lists.

default(obj): Encode sets as lists and everything else as by default.

atextcrawler.utils.json.json_dumps(obj): Encode an object to a JSON string using JSONEncoderExt.

atextcrawler.utils.json.json_loads(s, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw): Decoding of JSON strings as by default.

atextcrawler.utils.lang module

Utility functions related to languages.

atextcrawler.utils.lang.clean_lang(lang: Optional[str]) → Optional[str]: Clean a language code string: it must be an ISO 639-1 code or None.

atextcrawler.utils.lang.extract_content_language(text: str) → Optional[str]: Extract the language from a text.

atextcrawler.utils.link module

Hyperlinks (a href, link).

atextcrawler.utils.link.extract_domain(hostname: str) → str: Extract the lower-case domain from a hostname.

atextcrawler.utils.link.in_blacklist(hostname: str) → Optional[str]: Return a match of host in the blacklist, or None.

atextcrawler.utils.link.link_rels = {'alternate', 'author', 'canonical', 'pingback', 'webmention'}: Values of the rel attribute of link tags to keep.

atextcrawler.utils.link.load_blacklist(): Return the 10000 most popular internet domains.

atextcrawler.utils.link.meta_names = ('generator', 'lang', 'language', 'description', 'keywords', 'author', 'title', 'subject', 'revised', 'abstract', 'topic', 'summary', 'classfication', 'category', 'reply-to', 'owner', 'url', 'identifier-URL', 'geo.position', 'geo.region', 'geo.placename', 'dc.language')

Values of the name attribute of meta tags to keep.

atextcrawler.utils.link.meta_props = ('og:site_name', 'og:locale', 'og:type', 'og:latitude', 'og:longitude', 'og:street', 'og:locality', 'og:region', 'og:postal', 'og:country'): Values of the property attribute of meta tags to keep.

atextcrawler.utils.link.nofollow_link_rels = {'help', 'license', 'nofollow', 'noopener', 'noreferrer', 'search'}: Do not follow the hrefs in anchor tags with these values of the rel attribute.

atextcrawler.utils.muse module

Parse muse-formatted plaintext (delivered by amusewiki).

atextcrawler.utils.muse.amusewiki_fields = ['author', 'title', 'lang', 'LISTtitle', 'subtitle', 'SORTauthors', 'SORTtopics', 'date', 'pubdate', 'notes', 'source', 'publisher', 'isbn', 'seriesname', 'seriesnumber']: Amusewiki fields are (cf. https://amusewiki.org/library/manual)

atextcrawler.utils.muse.extract_muse_meta(meta, body) → dict: Extract meta information from muse header and muse body.

atextcrawler.utils.muse.parse_head(text: str) → dict: Parse a MUSE head and return a dict mapping field names to values.

atextcrawler.utils.muse.parse_muse(text: str) → Optional[tuple[dict, str]]: Parse a MUSE string returning meta information and the text body.

atextcrawler.utils.muse.split_head_body(text: str) → tuple[str, str]: Split a MUSE string into head and body and return both.

atextcrawler.utils.probe module

Utility functions for probing / sampling.

atextcrawler.utils.probe.extract_samples(items, n=5)

Extract up to n sample elements from the the given dict or list.

If items is a dict return the elements from the list of keys.

atextcrawler.utils.section module

Operations on text sections.

Semantic breaks are character positions within a text (0-offset) where a new section begins. More precisely, the character position contains a space and only at the next position begins a tag that is semantically breaking (e.g., a h1 or a br).

Each semantic break has a level, which means breaking strength. The lower the level (e.g., h1 has a lower level than h2), the stronger the break.

Implicitly, if position 0 has no semantic break, a semantic break at position 0 with level 80 is added.

Semantic breaks can be used to split a text into sections. The lower the maximum level of the semantic breaks taken into account, the coarser the segmentation and the fewer the sections. Each section is given the level of the semantic break at ist beginning.

From another point of view, sections have levels indicating the segmentation depth.

The levels for html tags are defined in tag.py.

The semantic_breaks argument in the functions below is a dict mapping the character position of the semantic break to the level of a section beginning at this position (if segmentation is done at this or a higher level).

atextcrawler.utils.section.concat_section_texts(text, semantic_breaks, min_len=2000)

Try to concat consecutive sections into chunks with a minimum length.

Yield (section_ids, combined_text).

atextcrawler.utils.section.iter_sections(text, semantic_breaks, max_level=59)

Iterate over sections, limiting to those with a maximum level.

Yield (start_pos, end_pos, level, text). text is assumed to have the first semantic break at position 0.

atextcrawler.utils.similarity module

Text similarity with simhash.

atextcrawler.utils.similarity.create_simhash(index: simhash.SimhashIndex, resource_id: int, simhash_instance: simhash.Simhash) → int

Add a resource with given id and simhash to a simhash index.

Return the simhash value shifted into PostgreSQL’s bigint range.

(The simhash field of the resource’s database entry is not updated.)

atextcrawler.utils.similarity.get_features(txt: str) → list[str]: Extract features from string for use with Simhash.

atextcrawler.utils.similarity.get_simhash(text: str) → simhash.Simhash: Return the Simhash of the given text.

async atextcrawler.utils.similarity.get_simhash_index(conn: asyncpg.connection.Connection, site_id: int) → simhash.SimhashIndex: Return a simhash index with hashes of all stored resources of the site.

atextcrawler.utils.similarity.postgresql_bigint_offset = 9223372036854775808: Subtract this number to get a PostgreSQL bigint from a 64bit int.

atextcrawler.utils.similarity.search_simhash(index: simhash.SimhashIndex, simhash_inst: simhash.Simhash) → list[int]: Return the ids of similar resources from the index.

atextcrawler.utils.similarity.simhash_from_bigint(bigint: int) → simhash.Simhash: Convert a simhash from PostgreSQL’s bigint to a Simhash instance.

atextcrawler.utils.similarity.simhash_to_bigint(simhash: simhash.Simhash) → int: Convert a simhash to PostgreSQL’s bigint value range.

atextcrawler.utils.tag module

Information collections related to html tags.

atextcrawler.utils.tag.all_self_closing_tags = ('area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'link', 'meta', 'param', 'source', 'track', 'wbr'): All self-closing tags of the html standard.

atextcrawler.utils.tag.drop_roles = ('banner', 'complementary', 'contentinfo', 'dialog', 'figure', 'form', 'img', 'search', 'switch'): Drop tags with these aria roles.

atextcrawler.utils.tag.drop_tags = ['applet', 'area', 'audio', 'base', 'basefont', 'bdi', 'bdo', 'button', 'canvas', 'code', 'command', 'data', 'datalist', 'dir', 'embed', 'fieldset', 'figure', 'form', 'frame', 'frameset', 'iframe', 'img', 'input', 'label', 'legend', 'map', 'menuitem', 'meter', 'noframes', 'noscript', 'object', 'optgroup', 'option', 'param', 'picture', 'progress', 'rp', 'rt', 'ruby', 'samp', 'script', 'select', 'source', 'style', 'svg', 'template', 'textarea', 'track', 'var', 'video']: Tags to drop, including their content.

atextcrawler.utils.tag.keep_tags = {'a': (0, 0, ''), 'abbr': (0, 0, 'st'), 'acronym': (0, 0, 'st'), 'address': (1, 0, 'm'), 'article': (1, 15, ''), 'aside': (1, 0, 'd'), 'b': (0, 0, 'st'), 'blockquote': (1, 65, 'q'), 'br': (1, 80, ''), 'caption': (1, 68, ''), 'center': (1, 50, ''), 'cite': (1, 0, 'd'), 'col': (1, 75, ''), 'colgroup': (1, 73, ''), 'dd': (1, 70, 'li'), 'del': (0, 0, 'se'), 'details': (1, 0, 'd'), 'dfn': (0, 0, 'st'), 'div': (1, 60, ''), 'dl': (1, 70, 'l'), 'dt': (1, 70, 'li'), 'em': (0, 0, 'st'), 'figcaption': (1, 0, ''), 'font': (0, 0, 's'), 'footer': (1, 15, ''), 'h1': (1, 30, ''), 'h2': (1, 32, ''), 'h3': (1, 34, ''), 'h4': (1, 36, ''), 'h5': (1, 38, ''), 'h6': (1, 40, ''), 'header': (1, 15, ''), 'hr': (1, 30, ''), 'i': (0, 0, 'st'), 'ins': (0, 0, 'se'), 'li': (1, 75, 'li'), 'main': (1, 10, ''), 'mark': (0, 0, 's'), 'nav': (1, 0, ''), 'ol': (1, 70, 'l'), 'p': (1, 60, ''), 'pre': (1, 65, 'q'), 'q': (1, 0, 'q'), 's': (0, 0, ''), 'section': (1, 24, ''), 'small': (0, 0, 'd'), 'span': (0, 0, 's'), 'strike': (0, 0, 'se'), 'strong': (0, 0, 'st'), 'sub': (0, 0, ''), 'summary': (1, 20, 'm'), 'sup': (0, 0, ''), 'table': (1, 65, ''), 'tbody': (1, 70, ''), 'td': (1, 78, ''), 'tfoot': (1, 70, ''), 'th': (1, 75, ''), 'thead': (1, 70, ''), 'time': (0, 0, 'm'), 'tr': (1, 75, ''), 'u': (0, 0, 's'), 'ul': (1, 70, 'l')}

Tags to keep for annotation, and their properties.

The properties are:

sep: whether to separate text at both sides of the tag with a space

lvl: structural depth level of content of this tag;
the paragraph level is 60; headings are below 60, listings above; a div below the tag will usually have the tag’s depth + 1

sem: semantic categories: zero or more of * s=span * l=listing * i=list_item * t=term * e=edit * d=details * q=quote * m=meta * x=exclude

atextcrawler.utils.tag.self_closing_tags = ('br', 'hr'): Those among keep_tags which are self-closing.