atextcrawler.site package

Submodules

atextcrawler.site.feeds module

High-level feed-related stuff.

See resource.feed for low-level stuff not primarily related to sites.

async atextcrawler.site.feeds.fetch_feeds(fetcher, conn, site) → Optional[datetime.datetime]: Fetch feeds, add new resources and return the latest content update time.

async atextcrawler.site.feeds.get_feeds(conn, site_id) → list[atextcrawler.models.Feed]: Return stored feeds for the given site.

async atextcrawler.site.feeds.store_new_feeds(conn, site_id, feeds: dict): Store new feeds in table site_feed.

atextcrawler.site.operations module

Operations on sites.

async atextcrawler.site.operations.checkin_site(app, conn: asyncpg.connection.Connection, site: atextcrawler.models.Site, crawl: atextcrawler.models.Crawl)

Unlock the site and schedule next crawl.

crawl is the crawl that has just finished (regularly or stopped).

If the crawl was stopped (t_end is None), just unlock the site.

Otherwise schedule a crawl of the same type. After a full crawl also a feed crawl is scheduled, if there was none scheduled.

async atextcrawler.site.operations.checkout_site(app, conn: asyncpg.connection.Connection) → tuple[typing.Optional[int], bool, bool]

Get the id of a site to be crawled and mark it with crawl_active=true.

Also return whether the site shall be fully crawled; if not, this means that just the resources from the feeds shall be crawled.

Also return whether more sites might be available.

async atextcrawler.site.operations.is_site_allowed(conn: asyncpg.connection.Connection, site_id: Optional[int], base_url: str) → Optional[bool]

Return True if the site is whitelisted, False if blacklisted, else None.

Also add missing site_ids to the annotations.

async atextcrawler.site.operations.process_site(fetcher, conn: asyncpg.connection.Connection, site: atextcrawler.models.Site)

Process a site: fetch and store more information.

Store external and internal links; find boilerplate texts; fetch sitemaps; fetch feeds; update date of last publication.

async atextcrawler.site.operations.update_site(app, fetcher, conn: asyncpg.connection.Connection, base_url, site: Optional[atextcrawler.models.Site] = None) → tuple[typing.Optional[atextcrawler.models.Site], bool]

Try to fetch base_url and return a site and whether a new one was created.

This function is run for all sites (including blacklisted and irrelevant ones). It determines whether the site shall be crawled.

If an errors occurs, return (None, False), and if a site was given, also set it to crawl_enabled=False and remove crawling schedules.

If base_url could be fetched, update the site, possibly creating a new one.

If the site has crawl_enabled, and no full crawl is scheduled, schedule one (by updating column next_full_crawl).

atextcrawler.site.parse module

Parsing of a site’s startpage.

async atextcrawler.site.parse.collect_external_links(startpage, meta_links) → dict[str, str]

Return external links (mapping from URL to link text) from startpage.

Also add links to alternate language variants of the site.

async atextcrawler.site.parse.collect_meta_links(soup, base_durl) → dict[str, typing.Any]: Collect link tags with site scope (feeds, linkbacks, canonical, …).

atextcrawler.site.parse.collect_meta_tags(soup): Collect selected meta tags (meta_names and meta_props) with their values.

atextcrawler.site.parse.cut_str(s: Optional[str], l: int) → Optional[str]: Cut a string s to a maximal length l from the left.

atextcrawler.site.parse.extract_languages(page, meta, meta_links) → set[str]

Extract languages from a page’s html tag, meta tags and HTTP headers.

Also add the language detected in the text content of the page.

Return a set of ISO 639-1 language codes.

atextcrawler.site.parse.extract_meta_texts(page, meta) → tuple[str, typing.Optional[str], list[str]]: Extract and return title, description, keywords from a page and meta tags.

async atextcrawler.site.parse.parse_startpage(startpage: atextcrawler.models.TextResource, app=None, site=None) → atextcrawler.models.Site

Parse a site’s startpage and return a Site instance.

If a site instance is given, update it.

atextcrawler.site.queue module

Queue of sites.

When processing a resource, its external links are put into database table site_queue. The items in site_queue are processed in process_site_queue(). This is done baseURL by baseURL (see iter_site_queue()). While doing this, cross-site links are put into table site_link.

async atextcrawler.site.queue.iter_site_queue(app, conn: asyncpg.connection.Connection) → AsyncIterator[tuple[str, dict[int, str]]]

Yield URLs with aggregated link information from site_queue.

Yield a URL and a dict mapping ids of linking sites to link texts.

async atextcrawler.site.queue.process_site_queue(app, pool): Loop over queued sites creating new sites and adding cross-site links.

async atextcrawler.site.queue.site_recently_updated(conn: asyncpg.connection.Connection, base_url: str, site_revisit_interval: float) → Optional[int]: Return the id of the site with given base_url if it was updated recently.

async atextcrawler.site.queue.store_incoming_site_site_links(conn: asyncpg.connection.Connection, site_id: int, links_from: dict)

Store incoming site-site links (irrespective of crawl_enabled).

site_id is the id of the site to which the links in links_from point.

atextcrawler.site.robots module

Fetch and evaluate a website’s robots.txt.

class atextcrawler.site.robots.RobotsInfo(site_url: str, user_agent: str = '*', session: Optional[aiohttp.client.ClientSession] = None)

Bases: urllib.robotparser.RobotFileParser

Obtain information from a site’s robots.txt.

After instantiation you must await startup().

can_fetch_url(url: str) → bool: Return whether fetching of the given url is allowed.

property delay: Optional[Union[int, float]]: The delay to be used between requests.

property site_maps: list[str]: The list of sitemaps of the site.

property user_agent: str: The user agent being used.

atextcrawler.site.seed module

Seeding of new installations with URLs from blacklists and whitelists.

async atextcrawler.site.seed.load_seeds(config: dict, pool: asyncpg.pool.Pool) → None

Add seed file contents (site blacklist and whitelist).

If there are sites already, do nothing.

Module contents

Websites.