-
-
Notifications
You must be signed in to change notification settings - Fork 22k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core: Automate generation of the char_range.inc
file
#101878
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your script does not merge adjacent ranges and also creates unnecessary diff due to literal format differences. I tried to modify your script to reduce the diff:
Script patch
@@ -1,16 +1,35 @@
"""
-Script used to dump char ranges
-for specific properties from
-the Unicode Character Database
-to the `char_range.inc` file.
+Script used to dump char ranges for specific properties from
+the Unicode Character Database to the `char_range.inc` file.
"""
import os
from typing import List, Tuple
from urllib.request import urlopen
-URL = "https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt"
+def merge_ranges(ranges: List[Tuple[str, str]]) -> List[Tuple[str, str]]:
+ if len(ranges) < 2:
+ return ranges
+
+ result: List[Tuple[str, str]] = []
+ last_start: int = int(ranges[0][0], 16)
+ last_end: int = int(ranges[0][1], 16)
+ for i in range(1, len(ranges)):
+ curr: Tuple[str, str] = ranges[i]
+ curr_start: int = int(curr[0], 16)
+ curr_end: int = int(curr[1], 16)
+ if last_end + 1 == curr_start:
+ last_end = curr_end
+ else:
+ result.append(("0x%x" % last_start, "0x%x" % last_end))
+ last_start = curr_start
+ last_end = curr_end
+ result.append(("0x%x" % last_start, "0x%x" % last_end))
+ return result
+
+
+URL = "https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt"
lines = [line.decode("utf-8") for line in urlopen(URL)]
@@ -20,9 +39,8 @@ uppercase_letter: List[Tuple[str, str]] = []
lowercase_letter: List[Tuple[str, str]] = []
unicode_letter: List[Tuple[str, str]] = []
-# Underscore technically isn't in XID_Start,
-# but for our purposes it's included.
-xid_start.append(("0x005F", "0x005F"))
+# Underscore technically isn't in XID_Start, but for our purposes it's included.
+xid_start.append(("0x005f", "0x005f"))
for line in lines:
if line.startswith("#") or not line.strip():
@@ -37,6 +55,8 @@ for line in lines:
range_end = char_range
if ".." in char_range:
range_start, range_end = char_range.split("..")
+ range_start = range_start.lower()
+ range_end = range_end.lower()
if char_property == "XID_Start":
xid_start.append((f"0x{range_start}", f"0x{range_end}"))
@@ -51,6 +71,11 @@ for line in lines:
xid_start.sort(key=lambda x: int(x[0], 16))
+xid_start = merge_ranges(xid_start)
+xid_continue = merge_ranges(xid_continue)
+uppercase_letter = merge_ranges(uppercase_letter)
+lowercase_letter = merge_ranges(lowercase_letter)
+unicode_letter = merge_ranges(unicode_letter)
char_range_str = f"""/**************************************************************************/
/* char_range.inc */
@@ -81,16 +106,22 @@ char_range_str = f"""/**********************************************************
/* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE */
/* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */
/**************************************************************************/
+
+// This file was generated using the `char_range_fetch.py` script.
+
#ifndef CHAR_RANGE_INC
#define CHAR_RANGE_INC
+
#include "core/typedefs.h"
+
// Unicode Derived Core Properties
// Source: {URL}
-// This file was generated using the `char_range_fetch.py` script.
+
struct CharRange {{
\tchar32_t start;
\tchar32_t end;
}};
+
constexpr inline CharRange xid_start[] = {{
\t"""
@@ -99,6 +130,7 @@ for start, end in xid_start:
char_range_str = char_range_str[:-1] # Remove trailing tab.
char_range_str += """};
+
constexpr inline CharRange xid_continue[] = {
\t"""
@@ -107,6 +139,7 @@ for start, end in xid_continue:
char_range_str = char_range_str[:-1] # Remove trailing tab.
char_range_str += """};
+
constexpr inline CharRange uppercase_letter[] = {
\t"""
@@ -115,6 +148,7 @@ for start, end in uppercase_letter:
char_range_str = char_range_str[:-1] # Remove trailing tab.
char_range_str += """};
+
constexpr inline CharRange lowercase_letter[] = {
\t"""
@@ -123,6 +157,7 @@ for start, end in lowercase_letter:
char_range_str = char_range_str[:-1] # Remove trailing tab.
char_range_str += """};
+
constexpr inline CharRange unicode_letter[] = {
\t"""
@@ -131,6 +166,7 @@ for start, end in unicode_letter:
char_range_str = char_range_str[:-1] # Remove trailing tab.
char_range_str += """};
+
#endif // CHAR_RANGE_INC
"""
After that I got this:
Source diff
@@ -28,6 +28,8 @@
/* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */
/**************************************************************************/
+// This file was generated using the `char_range_fetch.py` script.
+
#ifndef CHAR_RANGE_INC
#define CHAR_RANGE_INC
@@ -43,7 +45,7 @@ struct CharRange {
constexpr inline CharRange xid_start[] = {
{ 0x41, 0x5a },
- { 0x5f, 0x5f }, // Underscore technically isn't in XID_Start, but for our purposes it's included.
+ { 0x5f, 0x5f },
{ 0x61, 0x7a },
{ 0xaa, 0xaa },
{ 0xb5, 0xb5 },
Feel free to modify the script further, as I made minimal changes and the current version is probably not optimal.
1048576
to
1048576
Compare
Thanks, I didn't take into consideration that ranges can be adjacent 😅 When it comes to formatting differences, I'd like to stick to the way it's written in the UCD documents (and also in 2 other places in the Godot codebase - see: #90726 and #101880), but this can be taken into consideration in a separate PR, as it's far easier to validate this one if the diff isn't all over the place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Patch
@@ -1,170 +1,137 @@
-"""
-Script used to dump char ranges
-for specific properties from
-the Unicode Character Database
-to the `char_range.inc` file.
-"""
-
-import os
+#!/usr/bin/env python3
+
+# Script used to dump char ranges for specific properties from
+# the Unicode Character Database to the `char_range.inc` file.
+# NOTE: This script is deliberately not integrated into the build system;
+# you should run it manually whenever you want to update data.
+
+import os, sys
from typing import List, Tuple
from urllib.request import urlopen
-URL = "https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt"
+if __name__ == "__main__":
+ sys.path.insert(1, os.path.abspath("../../"))
+
+from methods import generate_copyright_header
-def int_as_hex(i: int) -> str:
- return f"0x{i:x}"
+URL: str = "https://www.unicode.org/Public/16.0.0/ucd/DerivedCoreProperties.txt"
+xid_start: List[Tuple[int, int]] = []
+xid_continue: List[Tuple[int, int]] = []
+uppercase_letter: List[Tuple[int, int]] = []
+lowercase_letter: List[Tuple[int, int]] = []
+unicode_letter: List[Tuple[int, int]] = []
-def merge_ranges(ranges: List[Tuple[str, str]]) -> List[Tuple[str, str]]:
+
+def merge_ranges(ranges: List[Tuple[int, int]]) -> None:
if len(ranges) < 2:
- return ranges
+ return
+
+ last_start: int = ranges[0][0]
+ last_end: int = ranges[0][1]
+ original_ranges: List[Tuple[int, int]] = ranges[1:]
- result: List[Tuple[str, str]] = []
- last_start = int(ranges[0][0], 16)
- last_end = int(ranges[0][1], 16)
+ ranges.clear()
- for curr_range in ranges[1:]:
- curr_start = int(curr_range[0], 16)
- curr_end = int(curr_range[1], 16)
+ for curr_range in original_ranges:
+ curr_start: int = curr_range[0]
+ curr_end: int = curr_range[1]
if last_end + 1 != curr_start:
- result.append((int_as_hex(last_start), int_as_hex(last_end)))
+ ranges.append((last_start, last_end))
last_start = curr_start
last_end = curr_end
- result.append((int_as_hex(last_start), int_as_hex(last_end)))
+
+ ranges.append((last_start, last_end))
+
+
+def parse_unicode_data() -> None:
+ lines: List[str] = [line.decode("utf-8") for line in urlopen(URL)]
+
+ for line in lines:
+ if line.startswith("#") or not line.strip():
+ continue
+
+ split_line: list[str] = line.split(";")
+
+ char_range: str = split_line[0].strip()
+ char_property: str = split_line[1].strip().split("#")[0].strip()
+
+ range_start: str = char_range
+ range_end: str = char_range
+ if ".." in char_range:
+ range_start, range_end = char_range.split("..")
+
+ range_tuple: Tuple[int, int] = (int(range_start, 16), int(range_end, 16))
+
+ if char_property == "XID_Start":
+ xid_start.append(range_tuple)
+ elif char_property == "XID_Continue":
+ xid_continue.append(range_tuple)
+ elif char_property == "Uppercase":
+ uppercase_letter.append(range_tuple)
+ elif char_property == "Lowercase":
+ lowercase_letter.append(range_tuple)
+ elif char_property == "Alphabetic":
+ unicode_letter.append(range_tuple)
+
+ # Underscore technically isn't in XID_Start, but for our purposes it's included.
+ xid_start.append((0x005F, 0x005F))
+ xid_start.sort(key=lambda x: x[0])
+
+ merge_ranges(xid_start)
+ merge_ranges(xid_continue)
+ merge_ranges(uppercase_letter)
+ merge_ranges(lowercase_letter)
+ merge_ranges(unicode_letter)
+
+
+def make_range(range_name: str, range_list: List[Tuple[int, int]]) -> str:
+ result: str = f"constexpr inline CharRange {range_name}[] = {{\n"
+
+ for start, end in range_list:
+ result += f"\t{{ 0x{start:x}, 0x{end:x} }},\n"
+
+ result += "};\n\n"
+
return result
-lines = [line.decode("utf-8") for line in urlopen(URL)]
-
-xid_start: List[Tuple[str, str]] = []
-xid_continue: List[Tuple[str, str]] = []
-uppercase_letter: List[Tuple[str, str]] = []
-lowercase_letter: List[Tuple[str, str]] = []
-unicode_letter: List[Tuple[str, str]] = []
-
-# Underscore technically isn't in XID_Start,
-# but for our purposes it's included.
-xid_start.append(("0x005F", "0x005F"))
-
-for line in lines:
- if line.startswith("#") or not line.strip():
- continue
-
- split_line = line.split(";")
-
- char_range = split_line[0].strip()
- char_property = split_line[1].strip().split("#")[0].strip()
-
- range_start = char_range
- range_end = char_range
- if ".." in char_range:
- range_start, range_end = char_range.split("..")
-
- if char_property == "XID_Start":
- xid_start.append((f"0x{range_start}", f"0x{range_end}"))
- elif char_property == "XID_Continue":
- xid_continue.append((f"0x{range_start}", f"0x{range_end}"))
- elif char_property == "Uppercase":
- uppercase_letter.append((f"0x{range_start}", f"0x{range_end}"))
- elif char_property == "Lowercase":
- lowercase_letter.append((f"0x{range_start}", f"0x{range_end}"))
- elif char_property == "Alphabetic":
- unicode_letter.append((f"0x{range_start}", f"0x{range_end}"))
-
-xid_start.sort(key=lambda x: int(x[0], 16))
-
-xid_start = merge_ranges(xid_start)
-xid_continue = merge_ranges(xid_continue)
-uppercase_letter = merge_ranges(uppercase_letter)
-lowercase_letter = merge_ranges(lowercase_letter)
-unicode_letter = merge_ranges(unicode_letter)
-
-
-char_range_str = f"""/**************************************************************************/
-/* char_range.inc */
-/**************************************************************************/
-/* This file is part of: */
-/* GODOT ENGINE */
-/* https://godotengine.org */
-/**************************************************************************/
-/* Copyright (c) 2014-present Godot Engine contributors (see AUTHORS.md). */
-/* Copyright (c) 2007-2014 Juan Linietsky, Ariel Manzur. */
-/* */
-/* Permission is hereby granted, free of charge, to any person obtaining */
-/* a copy of this software and associated documentation files (the */
-/* "Software"), to deal in the Software without restriction, including */
-/* without limitation the rights to use, copy, modify, merge, publish, */
-/* distribute, sublicense, and/or sell copies of the Software, and to */
-/* permit persons to whom the Software is furnished to do so, subject to */
-/* the following conditions: */
-/* */
-/* The above copyright notice and this permission notice shall be */
-/* included in all copies or substantial portions of the Software. */
-/* */
-/* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, */
-/* EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF */
-/* MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. */
-/* IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY */
-/* CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, */
-/* TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE */
-/* SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */
-/**************************************************************************/
+def generate_char_range_inc() -> None:
+ parse_unicode_data()
+
+ source: str = generate_copyright_header("char_range.inc")
+
+ source += f"""
+// This file was generated using the `char_range_fetch.py` script.
+
#ifndef CHAR_RANGE_INC
#define CHAR_RANGE_INC
+
#include "core/typedefs.h"
+
// Unicode Derived Core Properties
// Source: {URL}
-// This file was generated using the `char_range_fetch.py` script.
+
struct CharRange {{
\tchar32_t start;
\tchar32_t end;
-}};
-constexpr inline CharRange xid_start[] = {{
-\t"""
-
-for start, end in xid_start:
- char_range_str += f"{{ {start}, {end} }},\n\t"
-
-char_range_str = char_range_str[:-1] # Remove trailing tab.
-char_range_str += """};
-constexpr inline CharRange xid_continue[] = {
-\t"""
-
-for start, end in xid_continue:
- char_range_str += f"{{ {start}, {end} }},\n\t"
-
-char_range_str = char_range_str[:-1] # Remove trailing tab.
-char_range_str += """};
-constexpr inline CharRange uppercase_letter[] = {
-\t"""
-
-for start, end in uppercase_letter:
- char_range_str += f"{{ {start}, {end} }},\n\t"
-
-char_range_str = char_range_str[:-1] # Remove trailing tab.
-char_range_str += """};
-constexpr inline CharRange lowercase_letter[] = {
-\t"""
+}};\n\n"""
-for start, end in lowercase_letter:
- char_range_str += f"{{ {start}, {end} }},\n\t"
+ source += make_range("xid_start", xid_start)
+ source += make_range("xid_continue", xid_continue)
+ source += make_range("uppercase_letter", uppercase_letter)
+ source += make_range("lowercase_letter", lowercase_letter)
+ source += make_range("unicode_letter", unicode_letter)
-char_range_str = char_range_str[:-1] # Remove trailing tab.
-char_range_str += """};
-constexpr inline CharRange unicode_letter[] = {
-\t"""
+ source += "#endif // CHAR_RANGE_INC\n"
-for start, end in unicode_letter:
- char_range_str += f"{{ {start}, {end} }},\n\t"
+ char_range_path = os.path.join(os.path.dirname(__file__), "char_range.inc")
+ with open(char_range_path, "w", newline="\n") as f:
+ f.write(source)
-char_range_str = char_range_str[:-1] # Remove trailing tab.
-char_range_str += """};
-#endif // CHAR_RANGE_INC
-"""
+ print("`char_range.inc` generated successfully.")
-char_range_path = os.path.join(os.path.dirname(__file__), "char_range.inc")
-with open(char_range_path, "w", newline="\n") as f:
- f.write(char_range_str)
-print("`char_range.inc` generated successfully.")
+if __name__ == "__main__":
+ generate_char_range_inc()
1048576
to
1048576
Compare
Updated, though I omitted type hints wherever the type can be reasonably deduced from the expression (not only by the user; an IDE should have no problem with this). An argument against this could be that variables could hold some union types, but no explicit mention should indicate that that's not happening. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to move the script to misc\scripts\
to make it easier to find.
Note: Seems like script is missing executable
flag. Not a big deal, but it should be settable even when using an OS without POSIX permissions support like Windows, using command:
git update-index --chmod=+x <file>
1048576
to
1048576
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, though I omitted type hints wherever the type can be reasonably deduced from the expression (not only by the user; an IDE should have no problem with this).
I'm not insisting, but I think it's more obvious, similar to how we don't allow auto
in our C++ codebase (with a few exceptions). I also don't like that Python doesn't use explicit variable definitions. Type hints somewhat compensate for this and make it easier to find where a variable is first defined.
Co-authored-by: Danil Alexeev <dalexeev12@yandex.ru>
1048576
to
1048576
Compare
Not how I'd do it, but I can see merit in this approach; new version of the script is up :) |
Thanks! |
This PR automates (via a Python script) the update of the
char_range.inc
file when a new UCD version appears.