atextcrawler.resource package

Submodules

atextcrawler.resource.dedup module

Find boilerplate texts.

async atextcrawler.resource.dedup.store_boilerplate_texts(fetcher, conn, site)

Find and store boilerplate texts of a site.

Fetch the start page and internal sample links obtained from it. If there are sufficienty frequently appearing text sections, consider them as boilerplate texts.

If boilerplate_texts were found, update the given site instance.

atextcrawler.resource.document module

Parse documents (often application/pdf).

atextcrawler.resource.document.concat(s: Optional[Union[str, list]]) → Optional[str]: Helper function for joining strings together.

atextcrawler.resource.document.extract_latest(s: Optional[Union[str, list]]) → Optional[datetime.datetime]: Extract the lastest date (if any) from a string or list of strings.

async atextcrawler.resource.document.parse_document(durl: atextcrawler.utils.durl.Durl, resp: dict, site: Optional[atextcrawler.models.Site]) → Optional[Union[atextcrawler.models.TextResource, atextcrawler.models.ResourceError, atextcrawler.models.ResourceRedirect]]: Extract plain text from documents in various formats.

atextcrawler.resource.feed module

Stuff related to feeds.

Higher-level stuff is in site.feeds.

atextcrawler.resource.feed.convert_feed_entries(base_url: Optional[str], entries: list[dict]) → tuple[list[tuple[str, bool]], dict[str, tuple[typing.Optional[str], typing.Optional[str], typing.Optional[str]]]]

Extract paths and resource meta information from a feed’s entries.

Return paths in a structure wanted by add_site_paths() and resource meta information in a structure wanted by update_resource_meta().

atextcrawler.resource.feed.parse_json_feed(resp, data: dict) → atextcrawler.models.Feed

Parse a JSON response for jsonfeed information.

TODO: handle ‘next_url’ (see https://jsonfeed.org/version/1.1)

atextcrawler.resource.feed.parse_xml_feed(resp) → Union[atextcrawler.models.Feed, atextcrawler.models.ResourceError]: Parse a response from Fetcher.get_resp() for xml feed information.

async atextcrawler.resource.feed.update_feed(fetcher, feed, conn) → Optional[list[dict]]

Fetch, parse and return a given feed’s content. Also update feed.

If the server replied with HTTP 410, delete the feed. If there is no new information (server replied with HTTP 304), return None. For other errors also return None and increase the fail_count.

atextcrawler.resource.fetch module

Access to a resource specified by a URL.

atextcrawler.resource.fetch.MAX_REDIRECTS = 10: Maximum number of redirects to follow.

class atextcrawler.resource.fetch.ResourceFetcher(session: aiohttp.client.ClientSession, timeout_sock_connect: Union[int, float] = 8, timeout_sock_read: Union[int, float] = 30)

Bases: object

Fetch a resource specified by a URL (fetch()).

The timeout is the same for all requests.

async fetch(url: str, site: Optional[atextcrawler.models.Site] = None, redirect_history: Optional[list[str]] = None, headers: Optional[dict] = None) → Union[None, atextcrawler.models.MetaResource, atextcrawler.models.TextResource, atextcrawler.models.ResourceError, atextcrawler.models.ResourceRedirect]

Try to fetch a resource and return an instance or error or redirect.

If an error was encountered, return a ResourceError. If the resource has an irrelevant content type, return None. Otherwise return a specific content instance.

Argument redirect_history contains the redirect history; if one of the redirects is encountered again, return None.

async get_resp(durl: atextcrawler.utils.durl.Durl, headers: Optional[dict] = None, redirect_history: Optional[list[str]] = None) → Optional[Union[atextcrawler.models.ResourceError, dict]]

Try to fetch a url returning a ResourceError or a dict with content.

Optional headers will overwrite the :var:`default_headers`.

If the response status is not 200, always return an ResourceError.

If the content-type is not relevant (see blacklist_content_types), return None.

The dict contains these keys+values:

‘parser’: a hint on the parser to use for analyzing the content;
one of ‘html’, ‘plain’, ‘feed’, ‘xml’, ‘application’

‘content’: bytes for type application, otherwise str

‘redirects’: a list of URLs visited during HTTP redirection,
the last item is the final URL

‘headers’: response headers

atextcrawler.resource.fetch.blacklist_content_types = ['', 'application/ogg']: Blacklist for content-types.

atextcrawler.resource.fetch.default_headers = {'Accept-Language': 'en-US,en;q=0.5, *;q=0.5', 'DNT': '1', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (X11; Linux aarch64; rv:78.0) Gecko/20100101 Firefox/78.0'}: Default HTTP client headers, overwriting those of aiohttp.ClientSession.

async atextcrawler.resource.fetch.parse_json(durl: atextcrawler.utils.durl.Durl, response: dict) → Optional[Union[atextcrawler.models.Feed, atextcrawler.models.ResourceError]]: Parse the content of JSON feeds.

async atextcrawler.resource.fetch.parse_xml(durl: atextcrawler.utils.durl.Durl, response: dict, rss=False, atom=False) → Optional[Union[atextcrawler.models.MetaResource, atextcrawler.models.ResourceError]]

Parse XML content.

In particular, parse sitemapindex, sitemap, RSS feed, atom feed.

atextcrawler.resource.fetch.text_content_types = {'application/atom+xml': 'feed-atom', 'application/feed+json': 'feed-json', 'application/json': 'json', 'application/rss+xml': 'feed-rss', 'application/xml': 'xml', 'text/html': 'html', 'text/plain': 'plain', 'text/xml': 'xml'}: Map content-types to parsers.

atextcrawler.resource.operations module

Operations on resources.

async atextcrawler.resource.operations.add_site_paths(conn: asyncpg.connection.Connection, site_id: int, paths: Sequence[tuple[str, typing.Optional[bool]]]) → None

Add site paths. if resource infos are given, also create resources.

The paths must be given as relative paths and together with a boolean telling whether the link is a canonical link.

async atextcrawler.resource.operations.get_site_path(conn: asyncpg.connection.Connection, site: atextcrawler.models.Site, before: datetime.datetime, only_new=False) → Optional[atextcrawler.models.SitePath]

Return the next path of a given site that needs to be processed.

If none needs to be processed, return None.

Only return paths that have last been visited before before or not been processed at all. Paths with a ok_count of -3 or lower are dropped.

If only_new, limit to paths that have not been processed at all, irrespective of the value of before.

async atextcrawler.resource.operations.process_site_path(app, worker_number: int, conn: asyncpg.connection.Connection, fetcher: atextcrawler.resource.fetch.ResourceFetcher, tf: atextcrawler.tensorflow.TensorFlow, site: atextcrawler.models.Site, site_path: atextcrawler.models.SitePath) → bool

Fetch a path, deduplicate and if canonical, update and index the resource.

Return whether a new resource was handled that should contribute be statistics.

async atextcrawler.resource.operations.store_feed_entries(conn: asyncpg.connection.Connection, site: atextcrawler.models.Site, entries: list[dict]) → None: Add missing resources of a site from given feed entries.

async atextcrawler.resource.operations.update_resource_meta(conn: asyncpg.connection.Connection, site_id: int, resource_meta: dict) → None: Update meta information of existing resources using path to find them.

atextcrawler.resource.page module

Parse HTML pages.

atextcrawler.resource.page.filter_sections(text, annotations, boilerplate_texts): Filter out irrelevant sections using scores and factoring in neighbors.

async atextcrawler.resource.page.parse_html(durl: atextcrawler.utils.durl.Durl, resp: dict, site: Optional[atextcrawler.models.Site]) → Optional[Union[atextcrawler.models.TextResource, atextcrawler.models.ResourceError, atextcrawler.models.ResourceRedirect]]

Extract relevant data from a response returning a TextResource instance.

The given URL must be the full URL (incl. scheme and netloc) of the page.

atextcrawler.resource.plaintext module

Parse plaintext pages.

atextcrawler.resource.plaintext.MAX_LINK_TEXT_LENGTH = 100

Maximum length of a link’s text to be kept.

Cf. table site_link, column link_text.

atextcrawler.resource.plaintext.annotate_text(text)

Return annoations as :func:`utils.annotation.annotate`does.

Here we only have information on semantic breaks (in plaintext they are where empty lines are).

async atextcrawler.resource.plaintext.parse_plaintext(durl: atextcrawler.utils.durl.Durl, resp: dict, site: Optional[atextcrawler.models.Site]) → Optional[Union[atextcrawler.models.ResourceRedirect, atextcrawler.models.TextResource]]

Extract relevant data from a response returning a TextResource instance.

The given URL must be the full URL (incl. scheme and netloc) of the page.

atextcrawler.resource.sitemap module

Sitemap and SitemapIndex and related operations.

atextcrawler.resource.sitemap.extract_sitemap_paths(base_url: Optional[str], urls: list[dict]) → tuple[list[tuple[str, bool]], typing.Optional[datetime.datetime]]

Extract essential information from sitemap URLs.

Return a list of relative paths of the site’s resources (in a form to be easily fed into add_site_paths) and the datetime of the latest change.

Relative paths are computed using base_url.

async atextcrawler.resource.sitemap.get_sitemap_urls(fetcher, base_url: Optional[str], sitemaps=None) → list[dict]

Try to find sitemaps and fetch and return their URL content.

Each sitemapped URL is a dict with key ‘loc’ and optional key ‘lastmod’.

atextcrawler.resource.sitemap.parse_sitemap(urlset) → atextcrawler.models.Sitemap

Return a list of sitemap URLs.

Each URL is a dict with these keys+values:

loc: the full URL of a mapped resource

lastmod: optional datetime of its last modification

changefreq: optional info on the change frequency to be expected

priority: optional info on its priority relative to other resources

Cf. https://www.sitemaps.org/protocol.html

atextcrawler.resource.sitemap.parse_sitemapindex(sitemapindex): Parse a sitemap index returning a SitemapIndex with found sitemaps.