Core: Automate generation of the `char_range.inc` file #101878

Chubercik · 2025-01-21T15:30:04Z

This PR automates (via a Python script) the update of the char_range.inc file when a new UCD version appears.

dalexeev

Your script does not merge adjacent ranges and also creates unnecessary diff due to literal format differences. I tried to modify your script to reduce the diff:

Script patch

@@ -1,16 +1,35 @@
 """
-Script used to dump char ranges
-for specific properties from
-the Unicode Character Database
-to the `char_range.inc` file.
+Script used to dump char ranges for specific properties from
+the Unicode Character Database to the `char_range.inc` file.
 """
 
 import os
 from typing import List, Tuple
 from urllib.request import urlopen
 
-URL = "https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt"
 
+def merge_ranges(ranges: List[Tuple[str, str]]) -> List[Tuple[str, str]]:
+    if len(ranges) < 2:
+        return ranges
+
+    result: List[Tuple[str, str]] = []
+    last_start: int = int(ranges[0][0], 16)
+    last_end: int = int(ranges[0][1], 16)
+    for i in range(1, len(ranges)):
+        curr: Tuple[str, str] = ranges[i]
+        curr_start: int = int(curr[0], 16)
+        curr_end: int = int(curr[1], 16)
+        if last_end + 1 == curr_start:
+            last_end = curr_end
+        else:
+            result.append(("0x%x" % last_start, "0x%x" % last_end))
+            last_start = curr_start
+            last_end = curr_end
+    result.append(("0x%x" % last_start, "0x%x" % last_end))
+    return result
+
+
+URL = "https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt"
 
 lines = [line.decode("utf-8") for line in urlopen(URL)]
 
@@ -20,9 +39,8 @@ uppercase_letter: List[Tuple[str, str]] = []
 lowercase_letter: List[Tuple[str, str]] = []
 unicode_letter: List[Tuple[str, str]] = []
 
-# Underscore technically isn't in XID_Start,
-# but for our purposes it's included.
-xid_start.append(("0x005F", "0x005F"))
+# Underscore technically isn't in XID_Start, but for our purposes it's included.
+xid_start.append(("0x005f", "0x005f"))
 
 for line in lines:
     if line.startswith("#") or not line.strip():
@@ -37,6 +55,8 @@ for line in lines:
     range_end = char_range
     if ".." in char_range:
         range_start, range_end = char_range.split("..")
+    range_start = range_start.lower()
+    range_end = range_end.lower()
 
     if char_property == "XID_Start":
         xid_start.append((f"0x{range_start}", f"0x{range_end}"))
@@ -51,6 +71,11 @@ for line in lines:
 
 xid_start.sort(key=lambda x: int(x[0], 16))
 
+xid_start = merge_ranges(xid_start)
+xid_continue = merge_ranges(xid_continue)
+uppercase_letter = merge_ranges(uppercase_letter)
+lowercase_letter = merge_ranges(lowercase_letter)
+unicode_letter = merge_ranges(unicode_letter)
 
 char_range_str = f"""/**************************************************************************/
 /*  char_range.inc                                                        */
@@ -81,16 +106,22 @@ char_range_str = f"""/**********************************************************
 /* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE      */
 /* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.                 */
 /**************************************************************************/
+
+// This file was generated using the `char_range_fetch.py` script.
+
 #ifndef CHAR_RANGE_INC
 #define CHAR_RANGE_INC
+
 #include "core/typedefs.h"
+
 // Unicode Derived Core Properties
 // Source: {URL}
-// This file was generated using the `char_range_fetch.py` script.
+
 struct CharRange {{
 \tchar32_t start;
 \tchar32_t end;
 }};
+
 constexpr inline CharRange xid_start[] = {{
 \t"""
 
@@ -99,6 +130,7 @@ for start, end in xid_start:
 
 char_range_str = char_range_str[:-1]  # Remove trailing tab.
 char_range_str += """};
+
 constexpr inline CharRange xid_continue[] = {
 \t"""
 
@@ -107,6 +139,7 @@ for start, end in xid_continue:
 
 char_range_str = char_range_str[:-1]  # Remove trailing tab.
 char_range_str += """};
+
 constexpr inline CharRange uppercase_letter[] = {
 \t"""
 
@@ -115,6 +148,7 @@ for start, end in uppercase_letter:
 
 char_range_str = char_range_str[:-1]  # Remove trailing tab.
 char_range_str += """};
+
 constexpr inline CharRange lowercase_letter[] = {
 \t"""
 
@@ -123,6 +157,7 @@ for start, end in lowercase_letter:
 
 char_range_str = char_range_str[:-1]  # Remove trailing tab.
 char_range_str += """};
+
 constexpr inline CharRange unicode_letter[] = {
 \t"""
 
@@ -131,6 +166,7 @@ for start, end in unicode_letter:
 
 char_range_str = char_range_str[:-1]  # Remove trailing tab.
 char_range_str += """};
+
 #endif // CHAR_RANGE_INC
 """

After that I got this:

Source diff

@@ -28,6 +28,8 @@
 /* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.                 */
 /**************************************************************************/
 
+// This file was generated using the `char_range_fetch.py` script.
+
 #ifndef CHAR_RANGE_INC
 #define CHAR_RANGE_INC
 
@@ -43,7 +45,7 @@ struct CharRange {
 
 constexpr inline CharRange xid_start[] = {
 	{ 0x41, 0x5a },
-	{ 0x5f, 0x5f }, // Underscore technically isn't in XID_Start, but for our purposes it's included.
+	{ 0x5f, 0x5f },
 	{ 0x61, 0x7a },
 	{ 0xaa, 0xaa },
 	{ 0xb5, 0xb5 },

Feel free to modify the script further, as I made minimal changes and the current version is probably not optimal.

Chubercik · 2025-01-21T18:07:14Z

Thanks, I didn't take into consideration that ranges can be adjacent 😅

When it comes to formatting differences, I'd like to stick to the way it's written in the UCD documents (and also in 2 other places in the Godot codebase - see: #90726 and #101880), but this can be taken into consideration in a separate PR, as it's far easier to validate this one if the diff isn't all over the place.

dalexeev

Patch

@@ -1,170 +1,137 @@
-"""
-Script used to dump char ranges
-for specific properties from
-the Unicode Character Database
-to the `char_range.inc` file.
-"""
-
-import os
+#!/usr/bin/env python3
+
+# Script used to dump char ranges for specific properties from
+# the Unicode Character Database to the `char_range.inc` file.
+# NOTE: This script is deliberately not integrated into the build system;
+# you should run it manually whenever you want to update data.
+
+import os, sys
 from typing import List, Tuple
 from urllib.request import urlopen
 
-URL = "https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt"
+if __name__ == "__main__":
+    sys.path.insert(1, os.path.abspath("../../"))
+
+from methods import generate_copyright_header
 
 
-def int_as_hex(i: int) -> str:
-    return f"0x{i:x}"
+URL: str = "https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt"
 
+xid_start: List[Tuple[int, int]] = []
+xid_continue: List[Tuple[int, int]] = []
+uppercase_letter: List[Tuple[int, int]] = []
+lowercase_letter: List[Tuple[int, int]] = []
+unicode_letter: List[Tuple[int, int]] = []
 
-def merge_ranges(ranges: List[Tuple[str, str]]) -> List[Tuple[str, str]]:
+
+def merge_ranges(ranges: List[Tuple[int, int]]) -> None:
     if len(ranges) < 2:
-        return ranges
+        return
+
+    last_start: int = ranges[0][0]
+    last_end: int = ranges[0][1]
+    original_ranges: List[Tuple[int, int]] = ranges[1:]
 
-    result: List[Tuple[str, str]] = []
-    last_start = int(ranges[0][0], 16)
-    last_end = int(ranges[0][1], 16)
+    ranges.clear()
 
-    for curr_range in ranges[1:]:
-        curr_start = int(curr_range[0], 16)
-        curr_end = int(curr_range[1], 16)
+    for curr_range in original_ranges:
+        curr_start: int = curr_range[0]
+        curr_end: int = curr_range[1]
         if last_end + 1 != curr_start:
-            result.append((int_as_hex(last_start), int_as_hex(last_end)))
+            ranges.append((last_start, last_end))
             last_start = curr_start
         last_end = curr_end
-    result.append((int_as_hex(last_start), int_as_hex(last_end)))
+
+    ranges.append((last_start, last_end))
+
+
+def parse_unicode_data() -> None:
+    lines: List[str] = [line.decode("utf-8") for line in urlopen(URL)]
+
+    for line in lines:
+        if line.startswith("#") or not line.strip():
+            continue
+
+        split_line: list[str] = line.split(";")
+
+        char_range: str = split_line[0].strip()
+        char_property: str = split_line[1].strip().split("#")[0].strip()
+
+        range_start: str = char_range
+        range_end: str = char_range
+        if ".." in char_range:
+            range_start, range_end = char_range.split("..")
+
+        range_tuple: Tuple[int, int] = (int(range_start, 16), int(range_end, 16))
+
+        if char_property == "XID_Start":
+            xid_start.append(range_tuple)
+        elif char_property == "XID_Continue":
+            xid_continue.append(range_tuple)
+        elif char_property == "Uppercase":
+            uppercase_letter.append(range_tuple)
+        elif char_property == "Lowercase":
+            lowercase_letter.append(range_tuple)
+        elif char_property == "Alphabetic":
+            unicode_letter.append(range_tuple)
+
+    # Underscore technically isn't in XID_Start, but for our purposes it's included.
+    xid_start.append((0x005F, 0x005F))
+    xid_start.sort(key=lambda x: x[0])
+
+    merge_ranges(xid_start)
+    merge_ranges(xid_continue)
+    merge_ranges(uppercase_letter)
+    merge_ranges(lowercase_letter)
+    merge_ranges(unicode_letter)
+
+
+def make_range(range_name: str, range_list: List[Tuple[int, int]]) -> str:
+    result: str = f"constexpr inline CharRange {range_name}[] = {{\n"
+
+    for start, end in range_list:
+        result += f"\t{{ 0x{start:x}, 0x{end:x} }},\n"
+
+    result += "};\n\n"
+
     return result
 
 
-lines = [line.decode("utf-8") for line in urlopen(URL)]
-
-xid_start: List[Tuple[str, str]] = []
-xid_continue: List[Tuple[str, str]] = []
-uppercase_letter: List[Tuple[str, str]] = []
-lowercase_letter: List[Tuple[str, str]] = []
-unicode_letter: List[Tuple[str, str]] = []
-
-# Underscore technically isn't in XID_Start,
-# but for our purposes it's included.
-xid_start.append(("0x005F", "0x005F"))
-
-for line in lines:
-    if line.startswith("#") or not line.strip():
-        continue
-
-    split_line = line.split(";")
-
-    char_range = split_line[0].strip()
-    char_property = split_line[1].strip().split("#")[0].strip()
-
-    range_start = char_range
-    range_end = char_range
-    if ".." in char_range:
-        range_start, range_end = char_range.split("..")
-
-    if char_property == "XID_Start":
-        xid_start.append((f"0x{range_start}", f"0x{range_end}"))
-    elif char_property == "XID_Continue":
-        xid_continue.append((f"0x{range_start}", f"0x{range_end}"))
-    elif char_property == "Uppercase":
-        uppercase_letter.append((f"0x{range_start}", f"0x{range_end}"))
-    elif char_property == "Lowercase":
-        lowercase_letter.append((f"0x{range_start}", f"0x{range_end}"))
-    elif char_property == "Alphabetic":
-        unicode_letter.append((f"0x{range_start}", f"0x{range_end}"))
-
-xid_start.sort(key=lambda x: int(x[0], 16))
-
-xid_start = merge_ranges(xid_start)
-xid_continue = merge_ranges(xid_continue)
-uppercase_letter = merge_ranges(uppercase_letter)
-lowercase_letter = merge_ranges(lowercase_letter)
-unicode_letter = merge_ranges(unicode_letter)
-
-
-char_range_str = f"""/**************************************************************************/
-/*  char_range.inc                                                        */
-/**************************************************************************/
-/*                         This file is part of:                          */
-/*                             GODOT ENGINE                               */
-/*                        https://godotengine.org                         */
-/**************************************************************************/
-/* Copyright (c) 2014-present Godot Engine contributors (see AUTHORS.md). */
-/* Copyright (c) 2007-2014 Juan Linietsky, Ariel Manzur.                  */
-/*                                                                        */
-/* Permission is hereby granted, free of charge, to any person obtaining  */
-/* a copy of this software and associated documentation files (the        */
-/* "Software"), to deal in the Software without restriction, including    */
-/* without limitation the rights to use, copy, modify, merge, publish,    */
-/* distribute, sublicense, and/or sell copies of the Software, and to     */
-/* permit persons to whom the Software is furnished to do so, subject to  */
-/* the following conditions:                                              */
-/*                                                                        */
-/* The above copyright notice and this permission notice shall be         */
-/* included in all copies or substantial portions of the Software.        */
-/*                                                                        */
-/* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,        */
-/* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF     */
-/* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. */
-/* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY   */
-/* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,   */
-/* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE      */
-/* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.                 */
-/**************************************************************************/
+def generate_char_range_inc() -> None:
+    parse_unicode_data()
+
+    source: str = generate_copyright_header("char_range.inc")
+
+    source += f"""
+// This file was generated using the `char_range_fetch.py` script.
+
 #ifndef CHAR_RANGE_INC
 #define CHAR_RANGE_INC
+
 #include "core/typedefs.h"
+
 // Unicode Derived Core Properties
 // Source: {URL}
-// This file was generated using the `char_range_fetch.py` script.
+
 struct CharRange {{
 \tchar32_t start;
 \tchar32_t end;
-}};
-constexpr inline CharRange xid_start[] = {{
-\t"""
-
-for start, end in xid_start:
-    char_range_str += f"{{ {start}, {end} }},\n\t"
-
-char_range_str = char_range_str[:-1]  # Remove trailing tab.
-char_range_str += """};
-constexpr inline CharRange xid_continue[] = {
-\t"""
-
-for start, end in xid_continue:
-    char_range_str += f"{{ {start}, {end} }},\n\t"
-
-char_range_str = char_range_str[:-1]  # Remove trailing tab.
-char_range_str += """};
-constexpr inline CharRange uppercase_letter[] = {
-\t"""
-
-for start, end in uppercase_letter:
-    char_range_str += f"{{ {start}, {end} }},\n\t"
-
-char_range_str = char_range_str[:-1]  # Remove trailing tab.
-char_range_str += """};
-constexpr inline CharRange lowercase_letter[] = {
-\t"""
+}};\n\n"""
 
-for start, end in lowercase_letter:
-    char_range_str += f"{{ {start}, {end} }},\n\t"
+    source += make_range("xid_start", xid_start)
+    source += make_range("xid_continue", xid_continue)
+    source += make_range("uppercase_letter", uppercase_letter)
+    source += make_range("lowercase_letter", lowercase_letter)
+    source += make_range("unicode_letter", unicode_letter)
 
-char_range_str = char_range_str[:-1]  # Remove trailing tab.
-char_range_str += """};
-constexpr inline CharRange unicode_letter[] = {
-\t"""
+    source += "#endif // CHAR_RANGE_INC\n"
 
-for start, end in unicode_letter:
-    char_range_str += f"{{ {start}, {end} }},\n\t"
+    char_range_path = os.path.join(os.path.dirname(__file__), "char_range.inc")
+    with open(char_range_path, "w", newline="\n") as f:
+        f.write(source)
 
-char_range_str = char_range_str[:-1]  # Remove trailing tab.
-char_range_str += """};
-#endif // CHAR_RANGE_INC
-"""
+    print("`char_range.inc` generated successfully.")
 
-char_range_path = os.path.join(os.path.dirname(__file__), "char_range.inc")
-with open(char_range_path, "w", newline="\n") as f:
-    f.write(char_range_str)
 
-print("`char_range.inc` generated successfully.")
+if __name__ == "__main__":
+    generate_char_range_inc()

Chubercik · 2025-01-22T23:25:51Z

Updated, though I omitted type hints wherever the type can be reasonably deduced from the expression (not only by the user; an IDE should have no problem with this).

An argument against this could be that variables could hold some union types, but no explicit mention should indicate that that's not happening.

bruvzg

It might be better to move the script to misc\scripts\ to make it easier to find.

Note: Seems like script is missing executable flag. Not a big deal, but it should be settable even when using an OS without POSIX permissions support like Windows, using command:

git update-index --chmod=+x <file>

dalexeev

Updated, though I omitted type hints wherever the type can be reasonably deduced from the expression (not only by the user; an IDE should have no problem with this).

I'm not insisting, but I think it's more obvious, similar to how we don't allow auto in our C++ codebase (with a few exceptions). I also don't like that Python doesn't use explicit variable definitions. Type hints somewhat compensate for this and make it easier to find where a variable is first defined.

misc/scripts/char_range_fetch.py

Co-authored-by: Danil Alexeev <dalexeev12@yandex.ru>

Chubercik · 2025-01-23T18:28:47Z

I'm not insisting, but I think it's more obvious, similar to how we don't allow auto in our C++ codebase (with a few exceptions). I also don't like that Python doesn't use explicit variable definitions. Type hints somewhat compensate for this and make it easier to find where a variable is first defined.

Not how I'd do it, but I can see merit in this approach; new version of the script is up :)

Repiteo · 2025-03-07T21:20:19Z

Thanks!

Chubercik requested review from a team as code owners January 21, 2025 15:30

Chubercik mentioned this pull request Jan 21, 2025

Update ucaps.h to contain proper case matchings #90726

Merged

bruvzg self-requested a review January 21, 2025 15:36

bruvzg added enhancement topic:core labels Jan 21, 2025

bruvzg added this to the 4.4 milestone Jan 21, 2025

dalexeev reviewed Jan 21, 2025

View reviewed changes

Chubercik force-pushed the automate_char_range branch from 1048576 to 1048576 Compare January 21, 2025 18:02

dalexeev reviewed Jan 21, 2025

View reviewed changes

Repiteo modified the milestones: 4.4, 4.5 Jan 22, 2025

Chubercik force-pushed the automate_char_range branch from 1048576 to 1048576 Compare January 22, 2025 23:20

bruvzg approved these changes Jan 23, 2025

View reviewed changes

Chubercik force-pushed the automate_char_range branch from 1048576 to 1048576 Compare January 23, 2025 08:32

dalexeev approved these changes Jan 23, 2025

View reviewed changes

misc/scripts/char_range_fetch.py Outdated Show resolved Hide resolved

misc/scripts/char_range_fetch.py Outdated Show resolved Hide resolved

Automate generation of the char_range.inc file

1048576

Co-authored-by: Danil Alexeev <dalexeev12@yandex.ru>

Chubercik force-pushed the automate_char_range branch from 1048576 to 1048576 Compare January 23, 2025 18:27

Chubercik mentioned this pull request Jan 31, 2025

Optimize String _find_upper and _find_lower by handling low-bit characters (including normal latin) explicitly. #99971

Open

Repiteo merged commit bb8ef4e into godotengine:master Mar 7, 2025
19 checks passed

Chubercik deleted the automate_char_range branch March 7, 2025 22:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Automate generation of the `char_range.inc` file #101878

Core: Automate generation of the `char_range.inc` file #101878

Chubercik commented Jan 21, 2025 •

edited

Loading

dalexeev left a comment

Chubercik commented Jan 21, 2025

dalexeev left a comment

Chubercik commented Jan 22, 2025

bruvzg left a comment

dalexeev left a comment

Chubercik commented Jan 23, 2025

Repiteo commented Mar 7, 2025

Core: Automate generation of the char_range.inc file #101878

Core: Automate generation of the char_range.inc file #101878

Conversation

Chubercik commented Jan 21, 2025 • edited Loading

dalexeev left a comment

Choose a reason for hiding this comment

Chubercik commented Jan 21, 2025

dalexeev left a comment

Choose a reason for hiding this comment

Chubercik commented Jan 22, 2025

bruvzg left a comment

Choose a reason for hiding this comment

dalexeev left a comment

Choose a reason for hiding this comment

Chubercik commented Jan 23, 2025

Repiteo commented Mar 7, 2025

Core: Automate generation of the `char_range.inc` file #101878

Core: Automate generation of the `char_range.inc` file #101878

Chubercik commented Jan 21, 2025 •

edited

Loading