atextcrawler.plugin_defaults package
Submodules
atextcrawler.plugin_defaults.filter_resource_path module
Filter paths found in a resource.
This plugin implements rp_filter().
- atextcrawler.plugin_defaults.filter_resource_path.rp_filter(site, durl) Optional[str]
Adjust or filter found paths (may depend on site).
To filter out a path (i.e., not add it to table site_path) return None.
atextcrawler.plugin_defaults.filter_site module
Relevance estimation of sites.
This plugin implements site_filter().
- async atextcrawler.plugin_defaults.filter_site.site_filter(site: atextcrawler.models.Site) bool
Assess relevance of the site (using language-dependent criteria).
If the site shall be crawled, return True, else False.
atextcrawler.plugin_defaults.filter_site_path module
Plugin for filtering paths of a site to be retrieved.
This plugin implements sp_filter().
- atextcrawler.plugin_defaults.filter_site_path.sp_filter(site, path, robots) bool
Per-site path filter. Return whether the path shall be retrieved.