Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Automate generation of the char_range.inc file #101878

Merged
merged 1 commit into from
Mar 7, 2025

Conversation

Chubercik
Copy link
Contributor

@Chubercik Chubercik commented Jan 21, 2025

This PR automates (via a Python script) the update of the char_range.inc file when a new UCD version appears.

Copy link
Member

@dalexeev dalexeev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your script does not merge adjacent ranges and also creates unnecessary diff due to literal format differences. I tried to modify your script to reduce the diff:

Script patch
@@ -1,16 +1,35 @@
 """
-Script used to dump char ranges
-for specific properties from
-the Unicode Character Database
-to the `char_range.inc` file.
+Script used to dump char ranges for specific properties from
+the Unicode Character Database to the `char_range.inc` file.
 """
 
 import os
 from typing import List, Tuple
 from urllib.request import urlopen
 
-URL = "https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt"
 
+def merge_ranges(ranges: List[Tuple[str, str]]) -> List[Tuple[str, str]]:
+    if len(ranges) < 2:
+        return ranges
+
+    result: List[Tuple[str, str]] = []
+    last_start: int = int(ranges[0][0], 16)
+    last_end: int = int(ranges[0][1], 16)
+    for i in range(1, len(ranges)):
+        curr: Tuple[str, str] = ranges[i]
+        curr_start: int = int(curr[0], 16)
+        curr_end: int = int(curr[1], 16)
+        if last_end + 1 == curr_start:
+            last_end = curr_end
+        else:
+            result.append(("0x%x" % last_start, "0x%x" % last_end))
+            last_start = curr_start
+            last_end = curr_end
+    result.append(("0x%x" % last_start, "0x%x" % last_end))
+    return result
+
+
+URL = "https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt"
 
 lines = [line.decode("utf-8") for line in urlopen(URL)]
 
@@ -20,9 +39,8 @@ uppercase_letter: List[Tuple[str, str]] = []
 lowercase_letter: List[Tuple[str, str]] = []
 unicode_letter: List[Tuple[str, str]] = []
 
-# Underscore technically isn't in XID_Start,
-# but for our purposes it's included.
-xid_start.append(("0x005F", "0x005F"))
+# Underscore technically isn't in XID_Start, but for our purposes it's included.
+xid_start.append(("0x005f", "0x005f"))
 
 for line in lines:
     if line.startswith("#") or not line.strip():
@@ -37,6 +55,8 @@ for line in lines:
     range_end = char_range
     if ".." in char_range:
         range_start, range_end = char_range.split("..")
+    range_start = range_start.lower()
+    range_end = range_end.lower()
 
     if char_property == "XID_Start":
         xid_start.append((f"0x{range_start}", f"0x{range_end}"))
@@ -51,6 +71,11 @@ for line in lines:
 
 xid_start.sort(key=lambda x: int(x[0], 16))
 
+xid_start = merge_ranges(xid_start)
+xid_continue = merge_ranges(xid_continue)
+uppercase_letter = merge_ranges(uppercase_letter)
+lowercase_letter = merge_ranges(lowercase_letter)
+unicode_letter = merge_ranges(unicode_letter)
 
 char_range_str = f"""/**************************************************************************/
 /*  char_range.inc                                                        */
@@ -81,16 +106,22 @@ char_range_str = f"""/**********************************************************
 /* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE      */
 /* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.                 */
 /**************************************************************************/
+
+// This file was generated using the `char_range_fetch.py` script.
+
 #ifndef CHAR_RANGE_INC
 #define CHAR_RANGE_INC
+
 #include "core/typedefs.h"
+
 // Unicode Derived Core Properties
 // Source: {URL}
-// This file was generated using the `char_range_fetch.py` script.
+
 struct CharRange {{
 \tchar32_t start;
 \tchar32_t end;
 }};
+
 constexpr inline CharRange xid_start[] = {{
 \t"""
 
@@ -99,6 +130,7 @@ for start, end in xid_start:
 
 char_range_str = char_range_str[:-1]  # Remove trailing tab.
 char_range_str += """};
+
 constexpr inline CharRange xid_continue[] = {
 \t"""
 
@@ -107,6 +139,7 @@ for start, end in xid_continue:
 
 char_range_str = char_range_str[:-1]  # Remove trailing tab.
 char_range_str += """};
+
 constexpr inline CharRange uppercase_letter[] = {
 \t"""
 
@@ -115,6 +148,7 @@ for start, end in uppercase_letter:
 
 char_range_str = char_range_str[:-1]  # Remove trailing tab.
 char_range_str += """};
+
 constexpr inline CharRange lowercase_letter[] = {
 \t"""
 
@@ -123,6 +157,7 @@ for start, end in lowercase_letter:
 
 char_range_str = char_range_str[:-1]  # Remove trailing tab.
 char_range_str += """};
+
 constexpr inline CharRange unicode_letter[] = {
 \t"""
 
@@ -131,6 +166,7 @@ for start, end in unicode_letter:
 
 char_range_str = char_range_str[:-1]  # Remove trailing tab.
 char_range_str += """};
+
 #endif // CHAR_RANGE_INC
 """
 

After that I got this:

Source diff
@@ -28,6 +28,8 @@
 /* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.                 */
 /**************************************************************************/
 
+// This file was generated using the `char_range_fetch.py` script.
+
 #ifndef CHAR_RANGE_INC
 #define CHAR_RANGE_INC
 
@@ -43,7 +45,7 @@ struct CharRange {
 
 constexpr inline CharRange xid_start[] = {
 	{ 0x41, 0x5a },
-	{ 0x5f, 0x5f }, // Underscore technically isn't in XID_Start, but for our purposes it's included.
+	{ 0x5f, 0x5f },
 	{ 0x61, 0x7a },
 	{ 0xaa, 0xaa },
 	{ 0xb5, 0xb5 },

Feel free to modify the script further, as I made minimal changes and the current version is probably not optimal.

@Chubercik Chubercik force-pushed the automate_char_range branch from 1048576 to 1048576 Compare January 21, 2025 18:02
@Chubercik
Copy link
Contributor Author

Thanks, I didn't take into consideration that ranges can be adjacent 😅

When it comes to formatting differences, I'd like to stick to the way it's written in the UCD documents (and also in 2 other places in the Godot codebase - see: #90726 and #101880), but this can be taken into consideration in a separate PR, as it's far easier to validate this one if the diff isn't all over the place.

Copy link
Member

@dalexeev dalexeev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch
@@ -1,170 +1,137 @@
-"""
-Script used to dump char ranges
-for specific properties from
-the Unicode Character Database
-to the `char_range.inc` file.
-"""
-
-import os
+#!/usr/bin/env python3
+
+# Script used to dump char ranges for specific properties from
+# the Unicode Character Database to the `char_range.inc` file.
+# NOTE: This script is deliberately not integrated into the build system;
+# you should run it manually whenever you want to update data.
+
+import os, sys
 from typing import List, Tuple
 from urllib.request import urlopen
 
-URL = "https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt"
+if __name__ == "__main__":
+    sys.path.insert(1, os.path.abspath("../../"))
+
+from methods import generate_copyright_header
 
 
-def int_as_hex(i: int) -> str:
-    return f"0x{i:x}"
+URL: str = "https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt"
 
+xid_start: List[Tuple[int, int]] = []
+xid_continue: List[Tuple[int, int]] = []
+uppercase_letter: List[Tuple[int, int]] = []
+lowercase_letter: List[Tuple[int, int]] = []
+unicode_letter: List[Tuple[int, int]] = []
 
-def merge_ranges(ranges: List[Tuple[str, str]]) -> List[Tuple[str, str]]:
+
+def merge_ranges(ranges: List[Tuple[int, int]]) -> None:
     if len(ranges) < 2:
-        return ranges
+        return
+
+    last_start: int = ranges[0][0]
+    last_end: int = ranges[0][1]
+    original_ranges: List[Tuple[int, int]] = ranges[1:]
 
-    result: List[Tuple[str, str]] = []
-    last_start = int(ranges[0][0], 16)
-    last_end = int(ranges[0][1], 16)
+    ranges.clear()
 
-    for curr_range in ranges[1:]:
-        curr_start = int(curr_range[0], 16)
-        curr_end = int(curr_range[1], 16)
+    for curr_range in original_ranges:
+        curr_start: int = curr_range[0]
+        curr_end: int = curr_range[1]
         if last_end + 1 != curr_start:
-            result.append((int_as_hex(last_start), int_as_hex(last_end)))
+            ranges.append((last_start, last_end))
             last_start = curr_start
         last_end = curr_end
-    result.append((int_as_hex(last_start), int_as_hex(last_end)))
+
+    ranges.append((last_start, last_end))
+
+
+def parse_unicode_data() -> None:
+    lines: List[str] = [line.decode("utf-8") for line in urlopen(URL)]
+
+    for line in lines:
+        if line.startswith("#") or not line.strip():
+            continue
+
+        split_line: list[str] = line.split(";")
+
+        char_range: str = split_line[0].strip()
+        char_property: str = split_line[1].strip().split("#")[0].strip()
+
+        range_start: str = char_range
+        range_end: str = char_range
+        if ".." in char_range:
+            range_start, range_end = char_range.split("..")
+
+        range_tuple: Tuple[int, int] = (int(range_start, 16), int(range_end, 16))
+
+        if char_property == "XID_Start":
+            xid_start.append(range_tuple)
+        elif char_property == "XID_Continue":
+            xid_continue.append(range_tuple)
+        elif char_property == "Uppercase":
+            uppercase_letter.append(range_tuple)
+        elif char_property == "Lowercase":
+            lowercase_letter.append(range_tuple)
+        elif char_property == "Alphabetic":
+            unicode_letter.append(range_tuple)
+
+    # Underscore technically isn't in XID_Start, but for our purposes it's included.
+    xid_start.append((0x005F, 0x005F))
+    xid_start.sort(key=lambda x: x[0])
+
+    merge_ranges(xid_start)
+    merge_ranges(xid_continue)
+    merge_ranges(uppercase_letter)
+    merge_ranges(lowercase_letter)
+    merge_ranges(unicode_letter)
+
+
+def make_range(range_name: str, range_list: List[Tuple[int, int]]) -> str:
+    result: str = f"constexpr inline CharRange {range_name}[] = {{\n"
+
+    for start, end in range_list:
+        result += f"\t{{ 0x{start:x}, 0x{end:x} }},\n"
+
+    result += "};\n\n"
+
     return result
 
 
-lines = [line.decode("utf-8") for line in urlopen(URL)]
-
-xid_start: List[Tuple[str, str]] = []
-xid_continue: List[Tuple[str, str]] = []
-uppercase_letter: List[Tuple[str, str]] = []
-lowercase_letter: List[Tuple[str, str]] = []
-unicode_letter: List[Tuple[str, str]] = []
-
-# Underscore technically isn't in XID_Start,
-# but for our purposes it's included.
-xid_start.append(("0x005F", "0x005F"))
-
-for line in lines:
-    if line.startswith("#") or not line.strip():
-        continue
-
-    split_line = line.split(";")
-
-    char_range = split_line[0].strip()
-    char_property = split_line[1].strip().split("#")[0].strip()
-
-    range_start = char_range
-    range_end = char_range
-    if ".." in char_range:
-        range_start, range_end = char_range.split("..")
-
-    if char_property == "XID_Start":
-        xid_start.append((f"0x{range_start}", f"0x{range_end}"))
-    elif char_property == "XID_Continue":
-        xid_continue.append((f"0x{range_start}", f"0x{range_end}"))
-    elif char_property == "Uppercase":
-        uppercase_letter.append((f"0x{range_start}", f"0x{range_end}"))
-    elif char_property == "Lowercase":
-        lowercase_letter.append((f"0x{range_start}", f"0x{range_end}"))
-    elif char_property == "Alphabetic":
-        unicode_letter.append((f"0x{range_start}", f"0x{range_end}"))
-
-xid_start.sort(key=lambda x: int(x[0], 16))
-
-xid_start = merge_ranges(xid_start)
-xid_continue = merge_ranges(xid_continue)
-uppercase_letter = merge_ranges(uppercase_letter)
-lowercase_letter = merge_ranges(lowercase_letter)
-unicode_letter = merge_ranges(unicode_letter)
-
-
-char_range_str = f"""/**************************************************************************/
-/*  char_range.inc                                                        */
-/**************************************************************************/
-/*                         This file is part of:                          */
-/*                             GODOT ENGINE                               */
-/*                        https://godotengine.org                         */
-/**************************************************************************/
-/* Copyright (c) 2014-present Godot Engine contributors (see AUTHORS.md). */
-/* Copyright (c) 2007-2014 Juan Linietsky, Ariel Manzur.                  */
-/*                                                                        */
-/* Permission is hereby granted, free of charge, to any person obtaining  */
-/* a copy of this software and associated documentation files (the        */
-/* "Software"), to deal in the Software without restriction, including    */
-/* without limitation the rights to use, copy, modify, merge, publish,    */
-/* distribute, sublicense, and/or sell copies of the Software, and to     */
-/* permit persons to whom the Software is furnished to do so, subject to  */
-/* the following conditions:                                              */
-/*                                                                        */
-/* The above copyright notice and this permission notice shall be         */
-/* included in all copies or substantial portions of the Software.        */
-/*                                                                        */
-/* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,        */
-/* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF     */
-/* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. */
-/* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY   */
-/* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,   */
-/* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE      */
-/* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.                 */
-/**************************************************************************/
+def generate_char_range_inc() -> None:
+    parse_unicode_data()
+
+    source: str = generate_copyright_header("char_range.inc")
+
+    source += f"""
+// This file was generated using the `char_range_fetch.py` script.
+
 #ifndef CHAR_RANGE_INC
 #define CHAR_RANGE_INC
+
 #include "core/typedefs.h"
+
 // Unicode Derived Core Properties
 // Source: {URL}
-// This file was generated using the `char_range_fetch.py` script.
+
 struct CharRange {{
 \tchar32_t start;
 \tchar32_t end;
-}};
-constexpr inline CharRange xid_start[] = {{
-\t"""
-
-for start, end in xid_start:
-    char_range_str += f"{{ {start}, {end} }},\n\t"
-
-char_range_str = char_range_str[:-1]  # Remove trailing tab.
-char_range_str += """};
-constexpr inline CharRange xid_continue[] = {
-\t"""
-
-for start, end in xid_continue:
-    char_range_str += f"{{ {start}, {end} }},\n\t"
-
-char_range_str = char_range_str[:-1]  # Remove trailing tab.
-char_range_str += """};
-constexpr inline CharRange uppercase_letter[] = {
-\t"""
-
-for start, end in uppercase_letter:
-    char_range_str += f"{{ {start}, {end} }},\n\t"
-
-char_range_str = char_range_str[:-1]  # Remove trailing tab.
-char_range_str += """};
-constexpr inline CharRange lowercase_letter[] = {
-\t"""
+}};\n\n"""
 
-for start, end in lowercase_letter:
-    char_range_str += f"{{ {start}, {end} }},\n\t"
+    source += make_range("xid_start", xid_start)
+    source += make_range("xid_continue", xid_continue)
+    source += make_range("uppercase_letter", uppercase_letter)
+    source += make_range("lowercase_letter", lowercase_letter)
+    source += make_range("unicode_letter", unicode_letter)
 
-char_range_str = char_range_str[:-1]  # Remove trailing tab.
-char_range_str += """};
-constexpr inline CharRange unicode_letter[] = {
-\t"""
+    source += "#endif // CHAR_RANGE_INC\n"
 
-for start, end in unicode_letter:
-    char_range_str += f"{{ {start}, {end} }},\n\t"
+    char_range_path = os.path.join(os.path.dirname(__file__), "char_range.inc")
+    with open(char_range_path, "w", newline="\n") as f:
+        f.write(source)
 
-char_range_str = char_range_str[:-1]  # Remove trailing tab.
-char_range_str += """};
-#endif // CHAR_RANGE_INC
-"""
+    print("`char_range.inc` generated successfully.")
 
-char_range_path = os.path.join(os.path.dirname(__file__), "char_range.inc")
-with open(char_range_path, "w", newline="\n") as f:
-    f.write(char_range_str)
 
-print("`char_range.inc` generated successfully.")
+if __name__ == "__main__":
+    generate_char_range_inc()

@Repiteo Repiteo modified the milestones: 4.4, 4.5 Jan 22, 2025
@Chubercik Chubercik force-pushed the automate_char_range branch from 1048576 to 1048576 Compare January 22, 2025 23:20
@Chubercik
Copy link
Contributor Author

Updated, though I omitted type hints wherever the type can be reasonably deduced from the expression (not only by the user; an IDE should have no problem with this).

An argument against this could be that variables could hold some union types, but no explicit mention should indicate that that's not happening.

Copy link
Member

@bruvzg bruvzg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to move the script to misc\scripts\ to make it easier to find.

Note: Seems like script is missing executable flag. Not a big deal, but it should be settable even when using an OS without POSIX permissions support like Windows, using command:

git update-index --chmod=+x <file>

@Chubercik Chubercik force-pushed the automate_char_range branch from 1048576 to 1048576 Compare January 23, 2025 08:32
Copy link
Member

@dalexeev dalexeev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, though I omitted type hints wherever the type can be reasonably deduced from the expression (not only by the user; an IDE should have no problem with this).

I'm not insisting, but I think it's more obvious, similar to how we don't allow auto in our C++ codebase (with a few exceptions). I also don't like that Python doesn't use explicit variable definitions. Type hints somewhat compensate for this and make it easier to find where a variable is first defined.

Co-authored-by: Danil Alexeev <dalexeev12@yandex.ru>
@Chubercik Chubercik force-pushed the automate_char_range branch from 1048576 to 1048576 Compare January 23, 2025 18:27
@Chubercik
Copy link
Contributor Author

I'm not insisting, but I think it's more obvious, similar to how we don't allow auto in our C++ codebase (with a few exceptions). I also don't like that Python doesn't use explicit variable definitions. Type hints somewhat compensate for this and make it easier to find where a variable is first defined.

Not how I'd do it, but I can see merit in this approach; new version of the script is up :)

@Repiteo Repiteo merged commit bb8ef4e into godotengine:master Mar 7, 2025
19 checks passed
@Repiteo
Copy link
Contributor

Repiteo commented Mar 7, 2025

Thanks!

@Chubercik Chubercik deleted the automate_char_range branch March 7, 2025 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants