atextcrawler.plugin_defaults package

Submodules

atextcrawler.plugin_defaults.filter_resource_path module

Filter paths found in a resource.

This plugin implements rp_filter().

atextcrawler.plugin_defaults.filter_resource_path.rp_filter(site, durl) Optional[str]

Adjust or filter found paths (may depend on site).

To filter out a path (i.e., not add it to table site_path) return None.

atextcrawler.plugin_defaults.filter_site module

Relevance estimation of sites.

This plugin implements site_filter().

async atextcrawler.plugin_defaults.filter_site.site_filter(site: atextcrawler.models.Site) bool

Assess relevance of the site (using language-dependent criteria).

If the site shall be crawled, return True, else False.

atextcrawler.plugin_defaults.filter_site_path module

Plugin for filtering paths of a site to be retrieved.

This plugin implements sp_filter().

atextcrawler.plugin_defaults.filter_site_path.sp_filter(site, path, robots) bool

Per-site path filter. Return whether the path shall be retrieved.

Module contents