Related work
crawlers
https://github.com/riteshnaik/Crawling-and-Deduplication-of-Polar-Datasets-Using-Nutch-and-Tika
general
sitemap parsers
url handling
language detection
text extraction
deduplication
remove paragraphs with more than 50% word-7-tuples encountered previously
Extract more meta tags
https://github.com/shareaholic/shareaholic-api-docs/blob/master/shareaholic_meta_tags.md https://support.shareaholic.com/hc/en-us/articles/115003085186
Date parsing dependent on language
https://en.wikipedia.org/wiki/Date_format_by_country
https://en.wikipedia.org/wiki/Common_Locale_Data_Repository
https://pypi.org/project/dateparser/
https://github.com/ovalhub/pyicu
https://github.com/night-crawler/cldr-language-helpers
https://stackoverflow.com/questions/19927654/using-dateutil-parser-to-parse-a-date-in-another-language
ICU
https://unicode-org.github.io/icu/userguide/format_parse/datetime/examples.html#parse
https://gist.github.com/dpk/8325992
https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/classicu_1_1DateFormat.html
https://unicode-org.github.io/icu/userguide/
https://unicode-org.github.io/icu-docs/#/icu4c/
https://github.com/ovalhub/pyicu/blob/master/samples/break.py
https://www.unicode.org/reports/tr35/tr35-dates.html#Date_Field_Symbol_Table
https://www.unicode.org/reports/tr35/tr35-dates.html#months_days_quarters_eras
https://unicode-org.github.io/icu/userguide/format_parse/datetime/#formatting-dates-and-times-overview