Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse Wiktionary's HTML directly instead of using the API #38

Open
johnfactotum opened this issue Nov 15, 2022 · 0 comments
Open

Parse Wiktionary's HTML directly instead of using the API #38

johnfactotum opened this issue Nov 15, 2022 · 0 comments
Labels
enhancement New feature or request

Comments

@johnfactotum
Copy link
Owner

johnfactotum commented Nov 15, 2022

Wiktionary's definition API is buggy and lacking in features. Parsing the HTML seems like a better option. And AFAICT the API seems to be also just parsing the HTML.

The main problem is how to identify the language for each section. The API actually does a poor job on this. It's obvious that it's relying on a small, hardcoded name -> language code table, which would explain why uncommon languages (such as Old Norse) are not furnished with the proper code but instead get dumped in a field called other (see for example https://en.wiktionary.org/api/rest_v1/page/definition/rannsaka). Since we already know the target language code, it would be better if one could get a display name using the Intl API and use that to match against the headings. Another option would be to look for the .headword class with the desired lang attribute, but this won't work in cases where the headword template is absent (e.g. the example in #14 (comment) which only has a single {{see-ja}} template and nothing else).

The above only applies to the English instance of Wiktionary. It would be good to support other Wiktionaries as well. The good news is that French and Spanish Wiktionary both use templates (https://fr.wiktionary.org/wiki/Mod%C3%A8le:langue and https://es.wiktionary.org/wiki/Plantilla:lengua) for the language headings, and they have ID attributes set to the language code.

Russian, Japanese, and Esperanto Wiktionary also use templates for the language headings. But the rendered HTML does not contain the language code so it's not useful at all. One needs to either copy from the template's source code or use the display name API.

After mapping each heading to a language, one now only has to associate all other sibling elements to the nearest preceding heading, and the rest should be relatively straightforward.

@johnfactotum johnfactotum added the enhancement New feature or request label Nov 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant