Parse Wiktionary's HTML directly instead of using the API #38

johnfactotum · 2022-11-15T18:30:57Z

Wiktionary's definition API is buggy and lacking in features. Parsing the HTML seems like a better option. And AFAICT the API seems to be also just parsing the HTML.

The main problem is how to identify the language for each section. The API actually does a poor job on this. It's obvious that it's relying on a small, hardcoded name -> language code table, which would explain why uncommon languages (such as Old Norse) are not furnished with the proper code but instead get dumped in a field called other (see for example https://en.wiktionary.org/api/rest_v1/page/definition/rannsaka). Since we already know the target language code, it would be better if one could get a display name using the Intl API and use that to match against the headings. Another option would be to look for the .headword class with the desired lang attribute, but this won't work in cases where the headword template is absent (e.g. the example in #14 (comment) which only has a single {{see-ja}} template and nothing else).

The above only applies to the English instance of Wiktionary. It would be good to support other Wiktionaries as well. The good news is that French and Spanish Wiktionary both use templates (https://fr.wiktionary.org/wiki/Mod%C3%A8le:langue and https://es.wiktionary.org/wiki/Plantilla:lengua) for the language headings, and they have ID attributes set to the language code.

Russian, Japanese, and Esperanto Wiktionary also use templates for the language headings. But the rendered HTML does not contain the language code so it's not useful at all. One needs to either copy from the template's source code or use the display name API.

After mapping each heading to a language, one now only has to associate all other sibling elements to the nearest preceding heading, and the rest should be relatively straightforward.

The text was updated successfully, but these errors were encountered:

johnfactotum added the enhancement New feature or request label Nov 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse Wiktionary's HTML directly instead of using the API #38

Parse Wiktionary's HTML directly instead of using the API #38

johnfactotum commented Nov 15, 2022 •

edited

Loading

Parse Wiktionary's HTML directly instead of using the API #38

Parse Wiktionary's HTML directly instead of using the API #38

Comments

johnfactotum commented Nov 15, 2022 • edited Loading

johnfactotum commented Nov 15, 2022 •

edited

Loading