TODO

  • parse html time tags

  • site annotations:

    • categories

      • historical (no changes any more since n months)

      • news

    • local focus - geonames: http://download.geonames.org/export/dump/cities15000.zip

  • allow for tls in elasticsearch config

  • replace dashes, dots and quotes: https://github.com/kovidgoyal/calibre/blob/3dd95981398777f3c958e733209f3583e783b98c/src/calibre/utils/unsmarten.py

        '–': '--',
        '–': '--',
        '–': '--',
        '—': '---',
        '—': '---',
        '—': '---',
        '…': '...',
        '…': '...',
        '…': '...',
        '“': '"',
        '”': '"',
        '„': '"',
        '″': '"',
        '“': '"',
        '”': '"',
        '„': '"',
        '″': '"',
        '“':'"',
        '”':'"',
        '„':'"',
        '″':'"',
        '‘':"'",
        '’':"'",
        '′':"'",
        '‘':"'",
        '’':"'",
        '′':"'",
        '‘':"'",
        '’':"'",
        '′':"'",
  • normalize quotation marks and punctuation in general

    • https://unicode-table.com/en/sets/quotation-marks/

    • https://github.com/avian2/unidecode/blob/master/unidecode/x020.py

    • https://www.fileformat.info/info/unicode/category/Po/list.htm

    • https://www.gaijin.at/en/infos/unicode-character-table-punctuation

  • cancel crawls that take too long

  • search for “TODO” in code

  • feedparser has support for JSON feeds since commit a5939702b1fd0ec75d2b586255ff0e29e5a8a6fc (as of 2020-10-26 in “develop” branch, not part of a release) the version names are ‘json1’ and ‘json11’

  • allow site URLs with path, e.g. https://web.archive.org/web/20090320055457/http://www.geocities.com/kk_abacus/

  • add more languages

Ideas