atextcrawler

Contents:

  • Introduction
  • Installation
  • Maintenance
  • Development
    • Setup dev environment
    • Configure the instance
    • Run
    • Logging
    • Upgrading
    • Test and clean manually
    • Release
    • Useful commands
    • TODO
    • Ideas
    • Related work
      • crawlers
        • general
      • sitemap parsers
      • url handling
      • language detection
      • text extraction
      • deduplication
      • Extract more meta tags
      • Date parsing dependent on language
  • Reference
atextcrawler
  • »
  • Development »
  • Related work
  • View page source

Related work

  • collection of crawlers

  • collection of webscrapers

crawlers

  • acrawler

  • trafilatura

    • repo

    • intro

  • aiohttp_spider

  • scrapy

  • heritrix3

  • YaCy

  • searchmysite

  • spiderling

  • aiohttp_spider

  • https://github.com/riteshnaik/Crawling-and-Deduplication-of-Polar-Datasets-Using-Nutch-and-Tika

  • edge search engine

general

  • elastic enterprise search

sitemap parsers

  • ultimate-sitemap-parser

url handling

  • courlan

language detection

  • overview

  • guess_language-spirit

  • guess_language

  • cld3

text extraction

  • JusText demo

deduplication

  • PostgreSQL extension smlar

  • use smlar

  • remove paragraphs with more than 50% word-7-tuples encountered previously

Extract more meta tags

  • https://github.com/shareaholic/shareaholic-api-docs/blob/master/shareaholic_meta_tags.md https://support.shareaholic.com/hc/en-us/articles/115003085186

Date parsing dependent on language

  • https://en.wikipedia.org/wiki/Date_format_by_country

  • https://en.wikipedia.org/wiki/Common_Locale_Data_Repository

  • https://pypi.org/project/dateparser/

  • https://github.com/ovalhub/pyicu

  • https://github.com/night-crawler/cldr-language-helpers

  • https://stackoverflow.com/questions/19927654/using-dateutil-parser-to-parse-a-date-in-another-language

ICU

  • https://unicode-org.github.io/icu/userguide/format_parse/datetime/examples.html#parse

  • https://gist.github.com/dpk/8325992

  • https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1DateFormat.html

  • https://unicode-org.github.io/icu/userguide/

  • https://unicode-org.github.io/icu-docs/#/icu4c/

  • https://github.com/ovalhub/pyicu/blob/master/samples/break.py

  • https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table

  • https://www.unicode.org/reports/tr35/tr35-dates.html#months_days_quarters_eras

  • https://unicode-org.github.io/icu/userguide/format_parse/datetime/#formatting-dates-and-times-overview

Previous Next

© Copyright 2021, ibu radempa.

Built with Sphinx using a theme provided by Read the Docs.