Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorder reverse caps characters table for string lower case conversion #26760

Merged
merged 1 commit into from
Apr 8, 2019
Merged

Reorder reverse caps characters table for string lower case conversion #26760

merged 1 commit into from
Apr 8, 2019

Conversation

Xrayez
Copy link
Contributor

@Xrayez Xrayez commented Mar 7, 2019

The binary search algorithm used to lookup character codes in the table relies that the data must be ordered. This fixes to_lower() string method to convert upper case to lower case properly, so that the
algorithm doesn't terminate prematurely.

Fixes #26744 and potentially more languages which are not Latin.

@Chaosus Chaosus added this to the 3.1 milestone Mar 7, 2019
@bruvzg
Copy link
Member

bruvzg commented Mar 7, 2019

Case mapping in core/ucaps.h is definitely incomplete, quick Unicode Character Database gives 1407 lower->upper and 1390 upper->lower one-to-one records, plus there are one-to-many special mappings.

Unicode case mapping FAQ: https://unicode.org/faq/casemap_charprop.html

@akien-mga akien-mga modified the milestones: 3.1, 3.2 Mar 7, 2019
@Xrayez
Copy link
Contributor Author

Xrayez commented Mar 7, 2019

@bruvzg so you say that we should create a script to generate those mappings from that database? That's a ton of mappings and might decrease performance as well, hmm.

I managed to finish creating a unit test for this case, for some reason a simple == operator wouldn't do it. Probably because it converts both strings with c_str which is not unicode? (which is yet another bug to fix?) found solution.

@Xrayez Xrayez marked this pull request as ready for review March 7, 2019 17:04
@Xrayez Xrayez requested a review from reduz as a code owner March 7, 2019 17:04
@breakmt
Copy link

breakmt commented Mar 7, 2019

@Xrayez I'm sorry, but I think unit test is a little bit incorrect - doesn't your cycle check just 1 letter?

@bruvzg
Copy link
Member

bruvzg commented Mar 7, 2019

@bruvzg so you say that we should create a script to generate those mappings from that database? That's a ton of mappings and might decrease performance as well, hmm.

Probably. Mappings should have many long runs, we could compact each run into a single record.
For example something like Go unicode package do (same binary search but with a bit more complex table):

Remap function:
https://github.com/golang/go/blob/50bd1c4d4eb4fac8ddeb5f063c099daccfb71b26/src/unicode/letter.go#L210

Table:
https://github.com/golang/go/blob/master/src/unicode/tables.go#L6670

@Xrayez
Copy link
Contributor Author

Xrayez commented Mar 7, 2019

@breakmt thanks for pointing out, a mistake from copy pasting... And surely the test doesn't pass again!

It does work in GDScript, can't understand why it doesn't work in C++, should I use unicode literals like \u0444 perhaps?

EDIT: no, using Cyrillic unicode literals doesn't work either.
EDIT2: interestingly, with one Cyrillic character it worked fine.

@Xrayez
Copy link
Contributor Author

Xrayez commented Mar 7, 2019

Ah yes, forgot about different types of string literals in C++:

L"АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯ"

would be the correct unicode one that handles wchar_t type, the test passes!

@reduz
Copy link
Member

reduz commented Apr 4, 2019

is this the right PR, or the other one? :)

The binary search algorithm used to lookup character codes in the table
relies that the data must be ordered. This fixes `to_lower()` string
method to convert upper case to lower case properly, so that the
algorithm doesn't terminate prematurely.

Co-authored-by: AndreevAndrei (avandrei) <avandrei@MacBookAAV.local>
@Xrayez
Copy link
Contributor Author

Xrayez commented Apr 5, 2019

@reduz @akien-mga oh so my PR has unit test covered for this bug which #27598 doesn't:
https://github.com/godotengine/godot/blob/4858d2cd9b14249cbbdbc9fcf62c8e6561d8f292/main/tests/test_string.cpp#L1056-L1067

(also reordered commented out "dotless I" the same way)

Either way I added @AndreevAndrei as a co-author, the only problem is that he isn't detected by GitHub (git credentials are not matching) so his contribution will not be registered...

If @AndreevAndrei wants his contribution registered, he could add me as a co-author instead with these changes, or fix his git credentials first to match GitHub (email and username). 😃

@akien-mga akien-mga merged commit 1ca555f into godotengine:master Apr 8, 2019
@akien-mga
Copy link
Member

Thanks!

@Xrayez Xrayez deleted the 26744-fix-string-to-lower branch April 8, 2019 08:59
@akien-mga
Copy link
Member

Cherry-picked for 3.1.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_lower() method bug in Russian (cyrillic)
6 participants