atextcrawler.utils package

Submodules

atextcrawler.utils.annotation module

Convert html to plain text with annotations over character ranges.

class atextcrawler.utils.annotation.AnnotatingParser(*args, **kwargs)

Bases: html.parser.HTMLParser

Parse tagged text resulting in pure text and annotations.

The text is available in self.text and the annotations in self.annotations, which is a dict with these keys:

  • tags: contains a mapping of offset ranges (i, f) to the tags opening at i and closing at f

  • semantic_breaks: a mapping of offset positions where a new section begins to the nesting level of that sections; a section is whereever an (opening or closing) separating tag is placed in the raw html; for the separating flag of tags see tag.py

  • links: a mapping of hrefs to link texts obtained from anchor (a) tags; we skip hyperref with nofollow rels

  • section_ids: map an offset position to the first id attribute (of any tag) at the beginning of a semantic section; this can later be used in a URL fragment for linking directly into this section

Internally, we put opening tags on self.stack and pop them when the first matching closing tag is encountered. We assume balanced tags (tidy html).

NB: all tags with semantic breaks have sep=True, i.e., they will have spaces around them so that the semantic breaks always sit on a space; the semantic break position p is the end of the last section and the next sections begins at p + 1.

The text alway begins with a ‘ ‘ (added if not in the original), which is assigned a semantic break with default level 80 (if there is no semantic break tag at the beginning).

add_semantic_break(pos, lvl)

Add a semantic break of level lvl at position pos.

add_tag_id(pos)

Add and clear an id if the just closing section has none yet.

pos is the start position of the current section, and the position where the id will be added.

Add an id only if we are not too far in the section’s text already.

close()

Finish by collecting results in dict self.annotations.

Add a link covering character range (i, self.pos).

From html attrs extract href and rel.

forget_tag_id()

Reset a tag id if it is too far behind in the text stream.

handle_data(text)

Called for each non-tag content between tags.

handle_endtag(tag)

Called for each closing tag.

handle_starttag(tag, attrs)

Called for each opening tag.

atextcrawler.utils.annotation.MAX_HREF_LENGTH = 200

Maximum length of an href. Other links are discarded.

atextcrawler.utils.annotation.annotate(html)

Split html text into plain text with annotations (from AnnotatingParser).

atextcrawler.utils.annotation.annotations_remove_section(annotations, i, f)

Remove section (i, f) from annotations and return result.

atextcrawler.utils.annotation.clean_annotations(annotations: dict) None

Remove void stuff from annotations.

atextcrawler.utils.annotation.cut_range(i, f, d, t_i, t_f)

Return the new coordinates of a text range (t_i,t_f) after cutting (i,f).

If (t_i,t_f) is fully within (i,f), return None, None.

atextcrawler.utils.annotation.get_tag_counts(tag_names, i, f, tags, text) tuple[int, float, float]

Return the info on the share of characters covered with one of the tags.

Only consider the characters between i and f of string text.

Return the number of tags that have an overlap in the specified region, the tag density in the region (fraction of covered characters by all), and the average number of covered chars per tag.

NB: If more than one tag name is given, then the fractional share may exceed 1.

atextcrawler.utils.annotation.headline_probability(text, tags, lvl) float

Estimate the probability that the text with tags is a headline.

The context is not considered: The question is not whether the text is a headline for the following text.

atextcrawler.utils.annotation.pack_annotations(annotations)

Pack annotations to a special JSON string, reducing their volume a little.

atextcrawler.utils.annotation.range_overlap(i1, f1, i2, f2)

Return the overlap of both ranges (None if there is none).

atextcrawler.utils.annotation.text_blacklist = ['previous', 'next', 'back', '↩︎']

Texts to ignore.

atextcrawler.utils.annotation.unpack_annotations(json_text: str) dict

Unpack tag information from a string.

atextcrawler.utils.date_finder module

Find date expressions in a string.

atextcrawler.utils.date_finder.extract_dates(text: str, lang: Optional[str] = None) list[datetime.datetime]

Extract dates form a string, optionally limiting formats to a language.

atextcrawler.utils.date_finder.extract_latest_date(text: str, lang: Optional[str] = None) Optional[datetime.datetime]

Extract the latest date compatible with the lang from text.

Only consider dates in the past.

atextcrawler.utils.durl module

Hyperlink parsing.

class atextcrawler.utils.durl.Durl(url: str, base: Optional[atextcrawler.utils.durl.Durl] = None, match_base: bool = False)

Bases: object

Decomposed URL, contains urllib.parse.SplitResult.

When constructing this class, it has to be awaited, e.g.:

my_durl = await Durl(’http://www.example.com/whatever’)

The given URL will be decomposed, validated and normalized. If the URL is invalid, we return None instead of an instance.

If the given base is None, the URL must be absolute and the hostname must be valid (DNS lookup).

If the given URL is not absolute, an already decomposed (and thus valid) base Durl must be given; otherwise the URL is invalid.

The base Durl can contain a path (but no arguments or fragments), in which case the URL - if not absolute - must begin with this path.

The scheme must be http or https. If the URL begins with ‘//’, ‘http:’ is prepended.

If the hostname is longer than 90 characters, the URL is invalid.

Default port numbers (80 for http, 443 for https) are removed.

The hostname is changed to lower case. Spaces in the hostname make the URL invalid.

URL fragments are removed.

domain() str

Return the domain of the Durl (wrong in case of second-level domains).

has_path() bool

Return whether the Durl has a non-trivil path.

pwa() str

Return the (base-relative) path with args of the Durl.

replace_scheme(scheme: str) None

Replace the scheme (must be ‘http’ or ‘https’).

site() str

Return the site (base_url).

url() str

Return the URL as string.

Sort links into a cleaned, an internal and an external dict.

The cleaned dict maps absolute URLs to char ranges and relations. The internal dict maps absolute URLs to relations and the linked text. The external dict maps absolute URLs to relations and the linked text. The relations are link relations, e.g. rel=”canonical”.

The base_url is set, it is used to distinguish internal and external links. If it is not set, the base_url is obtained from durl.

atextcrawler.utils.durl.get_ips(hostname: str) set[str]

Return IPv4 and IPv6 addresses of the given hostname.

atextcrawler.utils.durl.get_url_variants(url: str) list[str]

Return variants of the URL.

Replace http with https and vice versa; prepend or remove ‘www.’ to or from the beginning of the hostname.

atextcrawler.utils.html module

Utilities for extracting information from html.

atextcrawler.utils.html.clean_body(body)

Clean an html body.

Remove unwanted tags (keeping their content); remove empty tags; remove and replace whitespaces in several ways.

In the end the only whitespace is a space and there are no consecutive spaces.

atextcrawler.utils.html.clean_html(s: Optional[str]) Optional[str]

Clean an html string.

Unescape htmlentities and replace whitespaces with ‘ ‘ (ASCII char 0x20).

See also: https://www.lesinskis.com/python-unicode-whitespace.html

atextcrawler.utils.html.clean_page(html)

Remove unwanted tags including their content from html.

Drop tags in drop_tags as well as tags with a role in drop_roles. Also drop tags with attribute aria-hidden=true.

Return a beautiful soup.

atextcrawler.utils.html.extract_title(html: str) Optional[str]

Extract title tags from html returning their content as a string.

atextcrawler.utils.html.get_html_lang(html: str) Optional[str]

Return the language, if any, found in the lang attribute of the html tag.

atextcrawler.utils.html.get_html_redirect(html: str) Optional[str]

Return an html redirect in an http-equiv meta tag.

If none is found, return None.

atextcrawler.utils.html.whitespace_tag_tag(match_obj)

Helper function for removing whitespace between tags.

atextcrawler.utils.http module

Utility functions related to http.

Extract canonical and shortlink links from http headers.

durl must be the Durl of the fetched page and site - i fnon None - must be the Site to which the page belongs.

Return a (default)dict with ‘canonical’ and ‘shortlink’ as keys. The values default to None.

atextcrawler.utils.json module

Custom JSON encoder.

class atextcrawler.utils.json.JSONEncoderExt(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)

Bases: json.encoder.JSONEncoder

Extended JSON encoder with encoding of sets as lists.

default(obj)

Encode sets as lists and everything else as by default.

atextcrawler.utils.json.json_dumps(obj)

Encode an object to a JSON string using JSONEncoderExt.

atextcrawler.utils.json.json_loads(s, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw)

Decoding of JSON strings as by default.

atextcrawler.utils.lang module

Utility functions related to languages.

atextcrawler.utils.lang.clean_lang(lang: Optional[str]) Optional[str]

Clean a language code string: it must be an ISO 639-1 code or None.

atextcrawler.utils.lang.extract_content_language(text: str) Optional[str]

Extract the language from a text.

atextcrawler.utils.muse module

Parse muse-formatted plaintext (delivered by amusewiki).

atextcrawler.utils.muse.amusewiki_fields = ['author', 'title', 'lang', 'LISTtitle', 'subtitle', 'SORTauthors', 'SORTtopics', 'date', 'pubdate', 'notes', 'source', 'publisher', 'isbn', 'seriesname', 'seriesnumber']

Amusewiki fields are (cf. https://amusewiki.org/library/manual)

atextcrawler.utils.muse.extract_muse_meta(meta, body) dict

Extract meta information from muse header and muse body.

atextcrawler.utils.muse.parse_head(text: str) dict

Parse a MUSE head and return a dict mapping field names to values.

atextcrawler.utils.muse.parse_muse(text: str) Optional[tuple[dict, str]]

Parse a MUSE string returning meta information and the text body.

atextcrawler.utils.muse.split_head_body(text: str) tuple[str, str]

Split a MUSE string into head and body and return both.

atextcrawler.utils.probe module

Utility functions for probing / sampling.

atextcrawler.utils.probe.extract_samples(items, n=5)

Extract up to n sample elements from the the given dict or list.

If items is a dict return the elements from the list of keys.

atextcrawler.utils.section module

Operations on text sections.

Semantic breaks are character positions within a text (0-offset) where a new section begins. More precisely, the character position contains a space and only at the next position begins a tag that is semantically breaking (e.g., a h1 or a br).

Each semantic break has a level, which means breaking strength. The lower the level (e.g., h1 has a lower level than h2), the stronger the break.

Implicitly, if position 0 has no semantic break, a semantic break at position 0 with level 80 is added.

Semantic breaks can be used to split a text into sections. The lower the maximum level of the semantic breaks taken into account, the coarser the segmentation and the fewer the sections. Each section is given the level of the semantic break at ist beginning.

From another point of view, sections have levels indicating the segmentation depth.

The levels for html tags are defined in tag.py.

The semantic_breaks argument in the functions below is a dict mapping the character position of the semantic break to the level of a section beginning at this position (if segmentation is done at this or a higher level).

atextcrawler.utils.section.concat_section_texts(text, semantic_breaks, min_len=2000)

Try to concat consecutive sections into chunks with a minimum length.

Yield (section_ids, combined_text).

atextcrawler.utils.section.iter_sections(text, semantic_breaks, max_level=59)

Iterate over sections, limiting to those with a maximum level.

Yield (start_pos, end_pos, level, text). text is assumed to have the first semantic break at position 0.

atextcrawler.utils.similarity module

Text similarity with simhash.

atextcrawler.utils.similarity.create_simhash(index: simhash.SimhashIndex, resource_id: int, simhash_instance: simhash.Simhash) int

Add a resource with given id and simhash to a simhash index.

Return the simhash value shifted into PostgreSQL’s bigint range.

(The simhash field of the resource’s database entry is not updated.)

atextcrawler.utils.similarity.get_features(txt: str) list[str]

Extract features from string for use with Simhash.

atextcrawler.utils.similarity.get_simhash(text: str) simhash.Simhash

Return the Simhash of the given text.

async atextcrawler.utils.similarity.get_simhash_index(conn: asyncpg.connection.Connection, site_id: int) simhash.SimhashIndex

Return a simhash index with hashes of all stored resources of the site.

atextcrawler.utils.similarity.postgresql_bigint_offset = 9223372036854775808

Subtract this number to get a PostgreSQL bigint from a 64bit int.

atextcrawler.utils.similarity.search_simhash(index: simhash.SimhashIndex, simhash_inst: simhash.Simhash) list[int]

Return the ids of similar resources from the index.

atextcrawler.utils.similarity.simhash_from_bigint(bigint: int) simhash.Simhash

Convert a simhash from PostgreSQL’s bigint to a Simhash instance.

atextcrawler.utils.similarity.simhash_to_bigint(simhash: simhash.Simhash) int

Convert a simhash to PostgreSQL’s bigint value range.

atextcrawler.utils.tag module

Information collections related to html tags.

atextcrawler.utils.tag.all_self_closing_tags = ('area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'link', 'meta', 'param', 'source', 'track', 'wbr')

All self-closing tags of the html standard.

atextcrawler.utils.tag.drop_roles = ('banner', 'complementary', 'contentinfo', 'dialog', 'figure', 'form', 'img', 'search', 'switch')

Drop tags with these aria roles.

atextcrawler.utils.tag.drop_tags = ['applet', 'area', 'audio', 'base', 'basefont', 'bdi', 'bdo', 'button', 'canvas', 'code', 'command', 'data', 'datalist', 'dir', 'embed', 'fieldset', 'figure', 'form', 'frame', 'frameset', 'iframe', 'img', 'input', 'label', 'legend', 'map', 'menuitem', 'meter', 'noframes', 'noscript', 'object', 'optgroup', 'option', 'param', 'picture', 'progress', 'rp', 'rt', 'ruby', 'samp', 'script', 'select', 'source', 'style', 'svg', 'template', 'textarea', 'track', 'var', 'video']

Tags to drop, including their content.

atextcrawler.utils.tag.keep_tags = {'a': (0, 0, ''), 'abbr': (0, 0, 'st'), 'acronym': (0, 0, 'st'), 'address': (1, 0, 'm'), 'article': (1, 15, ''), 'aside': (1, 0, 'd'), 'b': (0, 0, 'st'), 'blockquote': (1, 65, 'q'), 'br': (1, 80, ''), 'caption': (1, 68, ''), 'center': (1, 50, ''), 'cite': (1, 0, 'd'), 'col': (1, 75, ''), 'colgroup': (1, 73, ''), 'dd': (1, 70, 'li'), 'del': (0, 0, 'se'), 'details': (1, 0, 'd'), 'dfn': (0, 0, 'st'), 'div': (1, 60, ''), 'dl': (1, 70, 'l'), 'dt': (1, 70, 'li'), 'em': (0, 0, 'st'), 'figcaption': (1, 0, ''), 'font': (0, 0, 's'), 'footer': (1, 15, ''), 'h1': (1, 30, ''), 'h2': (1, 32, ''), 'h3': (1, 34, ''), 'h4': (1, 36, ''), 'h5': (1, 38, ''), 'h6': (1, 40, ''), 'header': (1, 15, ''), 'hr': (1, 30, ''), 'i': (0, 0, 'st'), 'ins': (0, 0, 'se'), 'li': (1, 75, 'li'), 'main': (1, 10, ''), 'mark': (0, 0, 's'), 'nav': (1, 0, ''), 'ol': (1, 70, 'l'), 'p': (1, 60, ''), 'pre': (1, 65, 'q'), 'q': (1, 0, 'q'), 's': (0, 0, ''), 'section': (1, 24, ''), 'small': (0, 0, 'd'), 'span': (0, 0, 's'), 'strike': (0, 0, 'se'), 'strong': (0, 0, 'st'), 'sub': (0, 0, ''), 'summary': (1, 20, 'm'), 'sup': (0, 0, ''), 'table': (1, 65, ''), 'tbody': (1, 70, ''), 'td': (1, 78, ''), 'tfoot': (1, 70, ''), 'th': (1, 75, ''), 'thead': (1, 70, ''), 'time': (0, 0, 'm'), 'tr': (1, 75, ''), 'u': (0, 0, 's'), 'ul': (1, 70, 'l')}

Tags to keep for annotation, and their properties.

The properties are:

  • sep: whether to separate text at both sides of the tag with a space

  • lvl: structural depth level of content of this tag;

    the paragraph level is 60; headings are below 60, listings above; a div below the tag will usually have the tag’s depth + 1

  • sem: semantic categories: zero or more of * s=span * l=listing * i=list_item * t=term * e=edit * d=details * q=quote * m=meta * x=exclude

atextcrawler.utils.tag.self_closing_tags = ('br', 'hr')

Those among keep_tags which are self-closing.

Module contents