Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RegEx: Fix handling of unset/unknown capture groups #73973

Merged
merged 1 commit into from
Dec 2, 2024

Conversation

dalexeev
Copy link
Member

@dalexeev dalexeev commented Feb 26, 2023

Add the PCRE2_SUBSTITUTE_UNSET_EMPTY flag so that unset capture groups do not clear the line.

https://www.pcre.org/current/doc/html/pcre2api.html

PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that do not appear in the pattern to be treated as unset groups. This option should be used with care, because it means that a typo in a group name or number no longer causes the PCRE2_ERROR_NOSUBSTRING error.

PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including unknown groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be treated as empty strings when inserted as described above. If this option is not set, an attempt to insert an unset group causes the PCRE2_ERROR_UNSET error. This option does not influence the extended substitution syntax described below.

  • Add error message output.

Example:

var line := "property get_property set_property"

var re1 := RegEx.new()
re1.compile("((get_|set_)?)property")

var re2 := RegEx.new()
re2.compile("(get_|set_)?property")

print("|", re1.sub(line, "$1new_property", true), "|")
print("|", re2.sub(line, "$1new_property", true), "|")
print("|", re2.sub(line, "$5new_property", true), "|")

Before:

|new_property get_new_property set_new_property|
||
||

After:

|new_property get_new_property set_new_property|
|new_property get_new_property set_new_property|
  modules/regex/regex.cpp:332 - PCRE2 Error: unknown substring
|new_property new_property new_property|

@dalexeev dalexeev requested a review from a team as a code owner February 26, 2023 12:16
@akien-mga akien-mga added this to the 4.1 milestone Feb 26, 2023
@akien-mga akien-mga modified the milestones: 4.1, 4.2 Jun 19, 2023
@YuriSizov YuriSizov modified the milestones: 4.2, 4.3 Nov 13, 2023
@dalexeev dalexeev modified the milestones: 4.3, 4.4 Jul 24, 2024
@akien-mga
Copy link
Member

akien-mga commented Nov 29, 2024

This makes sense to me. It makes the behavior consistent with Python 3's re.sub().

$ python3
Python 3.13.0 (main, Oct  8 2024, 00:00:00) [GCC 14.2.1 20240912 (Red Hat 14.2.1-3)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> line = "property get_property set_property"
>>> res1 = re.sub(r"((get_|set_)?)property", r"\1new_property", line)
>>> res1
'new_property get_new_property set_new_property'
>>> res2 = re.sub(r"(get_|set_)?property", r"\1new_property", line)
>>> res2
'new_property get_new_property set_new_property'
>>> res3 = re.sub(r"(get_|set_)?property", r"\5new_property", line)
Traceback (most recent call last):
  File "<python-input-8>", line 1, in <module>
    res3 = re.sub(r"(get_|set_)?property", r"\5new_property", line)
  File "/usr/lib64/python3.13/re/__init__.py", line 208, in sub
    return _compile(pattern, flags).sub(repl, string, count)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.13/re/__init__.py", line 377, in _compile_template
    return _sre.template(pattern, _parser.parse_template(repl, pattern))
                                  ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.13/re/_parser.py", line 1070, in parse_template
    addgroup(int(this[1:]), len(this) - 1)
    ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.13/re/_parser.py", line 1015, in addgroup
    raise s.error("invalid group reference %d" % index, pos)
re.PatternError: invalid group reference 5 at position 1
>>> res3
Traceback (most recent call last):
  File "<python-input-9>", line 1, in <module>
    res3
NameError: name 'res3' is not defined. Did you mean: 'res1'?

Also checked sed:

$ echo "property get_property set_property" | sed 's/\(\(get_\|set_\)?\)property/\1new_property/g'
property get_property set_property
$ echo "property get_property set_property" | sed 's/\(get_\|set_\)?property/\1new_property/g'
property get_property set_property
$ echo "property get_property set_property" | sed 's/\(get_\|set_\)?property/\5new_property/g'
sed: -e expression #1, char 42: invalid reference \5 on `s' command's RHS

And perl:

$ perl -pe 's/((get_|set_)?)property/\1new_property/g' <<< "property get_property set_property"
new_property get_new_property set_new_property
$ perl -pe 's/(get_|set_)?property/\1new_property/g' <<< "property get_property set_property"
new_property get_new_property set_new_property
$ perl -pe 's/(get_|set_)?property/\5new_property/g' <<< "property get_property set_property"
new_property new_property new_property

Copy link
Member

@akien-mga akien-mga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be merged as is, but indeed there's a case to be made about what the return value should be in case of failure.

@akien-mga
Copy link
Member

I've just been thinking that it would be good to add a unit test for those changes.

@dalexeev dalexeev changed the title RegEx: Add PCRE2_SUBSTITUTE_UNSET_EMPTY flag and output error messages RegEx: Fix handling of unset/unknown capture groups Nov 30, 2024
@akien-mga
Copy link
Member

Thanks!

@akien-mga akien-mga merged commit 8678069 into godotengine:master Dec 2, 2024
20 checks passed
@dalexeev dalexeev deleted the fix-regex-sub branch December 2, 2024 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add option to allow RegEx substitution empty replacement groups
3 participants