Skip to content
This repository was archived by the owner on Oct 3, 2022. It is now read-only.

adopt IETF language tags (BCP 47) #33

Open
jwilk opened this issue Feb 6, 2019 · 2 comments
Open

adopt IETF language tags (BCP 47) #33

jwilk opened this issue Feb 6, 2019 · 2 comments

Comments

@jwilk
Copy link
Member

jwilk commented Feb 6, 2019

We should use IETF language tags (BCP 47) instead of ISO 639-2 codes, or non-standard names Tesseract uses.

@jwilk
Copy link
Member Author

jwilk commented Oct 3, 2022

Here's a (partial) mapping between Tesseract script names and ISO 15924 script codes:

Tesseract ISO 15924
Arabic Arab
Armenian Armn
Bengali Beng
Canadian_Aboriginal Cans
Cherokee Cher
Cyrillic Cyrl
Devanagari Deva
Ethiopic Ethi
Fraktur Latf
Georgian Geor
Greek Grek
Gujarati Gujr
Gurmukhi Guru
HanS Hans
HanT Hant
Hangul Hang
Hebrew Hebr
Japanese Jpan
Kannada Knda
Khmer Khmr
Lao Laoo
Latin Latn
Malayalam Mlym
Myanmar Mymr
Oriya Orya
Sinhala Sinh
Syriac Syrc
Tamil Taml
Telugu Telu
Thaana Thaa
Thai Thai
Tibetan Tibt

The table doesn't cover:

  • vertical writing: HanS_vert, HanT_vert, Hangul_vert, Japanese_vert
  • Vietnamese

@jwilk
Copy link
Member Author

jwilk commented Oct 3, 2022

Most Tesseract language code are either ISO 639-2 or ISO 693-3 codes, possibly with some non-standard suffixes.

Here's a mapping between Tesseract suffixes and ISO 15924 script codes:

Tesseract ISO 15924
ara Arab
cyrl Cyrl
frak Latf
latn Latn
sim Hans
tra Hant

It's not clear how to map these:

  • vertical writing: chi_sim_vert, chi_tra_vert, jpn_vert, kor_vert
  • "old" variants: spa_old, kat_old, ita_old
  • equ ("Math / equation detection module")
  • osd ("Orientation and script detection module")

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

No branches or pull requests

1 participant