Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect behavior for Pattern::split #131

Closed
mahmoudimus opened this issue Feb 26, 2021 · 2 comments · Fixed by #138
Closed

Incorrect behavior for Pattern::split #131

mahmoudimus opened this issue Feb 26, 2021 · 2 comments · Fixed by #138
Assignees

Comments

@mahmoudimus
Copy link

How to reproduce

jshell> com.google.re2j.Pattern.compile("x*").split("foo")
$1 ==> String[4] { "", "f", "o", "o" }

jshell> java.util.regex.Pattern.compile("x*").split("foo")
$2 ==> String[3] { "f", "o", "o" }

Versions

  • java 11.0.10
  • re2j 1.5

Should I try for a PR?

@sjamesr
Copy link
Contributor

sjamesr commented Feb 26, 2021

Interesting, RE2J and java.util.regex evidently differ on how to handle leading zero-length matches. I guess we should bring RE2J into line. I'd appreciate a pull request if it isn't too much trouble for you.

Thanks for reporting this.

sjamesr added a commit to sjamesr/re2j that referenced this issue May 31, 2021
Fixes google#131.

This change modifies Pattern.split to omit a leading empty match. This
behavior was specified in JDK8 and brings RE2/J split into line with
more recent JDK implementations.

Furthermore, the split function no longer needs determine the number of
matches before assembling the result. The upshot is that the number of
find() calls is halved in many cases. The benchmark in the previous
change shows a significant improvement.

Reference impl (JDK):
BenchmarkSplit.benchmarkSplit     JDK  avgt    5  14.217 ± 0.410  us/op

RE2J (before):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  95.807 ± 6.737  us/op

RE2J (after):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  49.092 ± 0.717  us/op
sjamesr added a commit to sjamesr/re2j that referenced this issue May 31, 2021
Fixes google#131.

This change modifies Pattern.split to omit a leading empty match. This
behavior was specified in JDK8 and brings RE2/J split into line with
more recent JDK implementations.

Furthermore, the split function no longer needs determine the number of
matches before assembling the result. The upshot is that the number of
find() calls is halved in many cases. The benchmark in the previous
change shows a significant improvement.

Reference impl (JDK):
BenchmarkSplit.benchmarkSplit     JDK  avgt    5  14.217 ± 0.410  us/op

RE2J (before):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  95.807 ± 6.737  us/op

RE2J (after):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  49.092 ± 0.717  us/op
sjamesr added a commit to sjamesr/re2j that referenced this issue May 31, 2021
Fixes google#131.

This change modifies Pattern.split to omit a leading empty match. This
behavior was specified in JDK8 and brings RE2/J split into line with
more recent JDK implementations.

Furthermore, the split function no longer needs determine the number of
matches before assembling the result. The upshot is that the number of
find() calls is halved in many cases. The benchmark in the previous
change shows a significant improvement.

Reference impl (JDK):
BenchmarkSplit.benchmarkSplit     JDK  avgt    5  14.217 ± 0.410  us/op

RE2J (before):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  95.807 ± 6.737  us/op

RE2J (after):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  49.092 ± 0.717  us/op
sjamesr added a commit to sjamesr/re2j that referenced this issue May 31, 2021
Fixes google#131.

This change modifies Pattern.split to omit a leading empty match. This
behavior was specified in JDK8 and brings RE2/J split into line with
more recent JDK implementations.

Furthermore, the split function no longer needs determine the number of
matches before assembling the result. The upshot is that the number of
find() calls is halved in many cases. The benchmark in the previous
change shows a significant improvement.

Reference impl (JDK):
BenchmarkSplit.benchmarkSplit     JDK  avgt    5  14.217 ± 0.410  us/op

RE2J (before):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  95.807 ± 6.737  us/op

RE2J (after):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  49.092 ± 0.717  us/op
@sjamesr
Copy link
Contributor

sjamesr commented May 31, 2021

@adonovan would you be able to take a look at #138? Unfortunately this change causes RE2J's Pattern.split method to be textually quite different from Go's. Also, it changes the behavior of split to omit empty matches, which brings it into line with java.util.regex but probably not with RE2 (I didn't test the behavior of other regex impls).

@sjamesr sjamesr mentioned this issue May 31, 2021
sjamesr added a commit to sjamesr/re2j that referenced this issue May 31, 2021
Fixes google#131.

This change modifies Pattern.split to omit a leading empty match. This
behavior was specified in JDK8 and brings RE2/J split into line with
more recent JDK implementations.

Furthermore, the split function no longer needs determine the number of
matches before assembling the result. The upshot is that the number of
find() calls is halved in many cases. The benchmark in the previous
change shows a significant improvement.

Reference impl (JDK):
BenchmarkSplit.benchmarkSplit     JDK  avgt    5  14.217 ± 0.410  us/op

RE2J (before):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  95.807 ± 6.737  us/op

RE2J (after):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  49.092 ± 0.717  us/op
sjamesr added a commit to sjamesr/re2j that referenced this issue May 31, 2021
Fixes google#131.

This change modifies Pattern.split to omit a leading empty match. This
behavior was specified in JDK8 and brings RE2/J split into line with
more recent JDK implementations.

Furthermore, the split function no longer needs determine the number of
matches before assembling the result. The upshot is that the number of
find() calls is halved in many cases. The benchmark in the previous
change shows a significant improvement.

Reference impl (JDK):
BenchmarkSplit.benchmarkSplit     JDK  avgt    5  14.217 ± 0.410  us/op

RE2J (before):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  95.807 ± 6.737  us/op

RE2J (after):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  49.092 ± 0.717  us/op
sjamesr added a commit to sjamesr/re2j that referenced this issue May 31, 2021
Fixes google#131.

This change modifies Pattern.split to omit a leading empty match. This
behavior was specified in JDK8 and brings RE2/J split into line with
more recent JDK implementations.

Furthermore, the split function no longer needs determine the number of
matches before assembling the result. The upshot is that the number of
find() calls is halved in many cases. The benchmark in the previous
change shows a significant improvement.

Reference impl (JDK):
BenchmarkSplit.benchmarkSplit     JDK  avgt    5  14.217 ± 0.410  us/op

RE2J (before):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  95.807 ± 6.737  us/op

RE2J (after):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  49.092 ± 0.717  us/op
sjamesr added a commit to sjamesr/re2j that referenced this issue Jun 2, 2021
Fixes google#131.

This change modifies Pattern.split to omit a leading empty match. This
behavior was specified in JDK8 and brings RE2/J split into line with
more recent JDK implementations.

Furthermore, the split function no longer needs determine the number of
matches before assembling the result. The upshot is that the number of
find() calls is halved in many cases. The benchmark in the previous
change shows a significant improvement.

Reference impl (JDK):
BenchmarkSplit.benchmarkSplit     JDK  avgt    5  14.217 ± 0.410  us/op

RE2J (before):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  95.807 ± 6.737  us/op

RE2J (after):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  49.092 ± 0.717  us/op
sjamesr added a commit to sjamesr/re2j that referenced this issue Jun 2, 2021
Fixes google#131.

This change modifies Pattern.split to omit a leading empty match. This
behavior was specified in JDK8 and brings RE2/J split into line with
more recent JDK implementations.

Furthermore, the split function no longer needs determine the number of
matches before assembling the result. The upshot is that the number of
find() calls is halved in many cases. The benchmark in the previous
change shows a significant improvement.

Reference impl (JDK):
BenchmarkSplit.benchmarkSplit     JDK  avgt    5  14.217 ± 0.410  us/op

RE2J (before):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  95.807 ± 6.737  us/op

RE2J (after):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  49.092 ± 0.717  us/op
@sjamesr sjamesr self-assigned this Jun 2, 2021
sjamesr added a commit to sjamesr/re2j that referenced this issue Jun 3, 2021
Fixes google#131.

This change modifies Pattern.split to omit a leading empty match. This
behavior was specified in JDK8 and brings RE2/J split into line with
more recent JDK implementations.

Furthermore, the split function no longer needs determine the number of
matches before assembling the result. The upshot is that the number of
find() calls is halved in many cases. The benchmark in the previous
change shows a significant improvement.

Reference impl (JDK):
BenchmarkSplit.benchmarkSplit     JDK  avgt    5  14.217 ± 0.410  us/op

RE2J (before):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  95.807 ± 6.737  us/op

RE2J (after):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  49.092 ± 0.717  us/op
sjamesr added a commit to sjamesr/re2j that referenced this issue Jun 6, 2021
Fixes google#131.

This change modifies Pattern.split to omit a leading empty match. This
behavior was specified in JDK8 and brings RE2/J split into line with
more recent JDK implementations.

Furthermore, the split function no longer needs determine the number of
matches before assembling the result. The upshot is that the number of
find() calls is halved in many cases. The benchmark in the previous
change shows a significant improvement.

Reference impl (JDK):
BenchmarkSplit.benchmarkSplit     JDK  avgt    5  14.217 ± 0.410  us/op

RE2J (before):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  95.807 ± 6.737  us/op

RE2J (after):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  49.092 ± 0.717  us/op
sjamesr added a commit to sjamesr/re2j that referenced this issue Jul 12, 2021
Fixes google#131.

This change modifies Pattern.split to omit a leading empty match. This
behavior was specified in JDK8 and brings RE2/J split into line with
more recent JDK implementations.

Furthermore, the split function no longer needs determine the number of
matches before assembling the result. The upshot is that the number of
find() calls is halved in many cases. The benchmark in the previous
change shows a significant improvement.

Reference impl (JDK):
BenchmarkSplit.benchmarkSplit     JDK  avgt    5  14.217 ± 0.410  us/op

RE2J (before):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  95.807 ± 6.737  us/op

RE2J (after):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  49.092 ± 0.717  us/op
sjamesr added a commit to sjamesr/re2j that referenced this issue Jun 27, 2022
Fixes google#131.

This change modifies Pattern.split to omit a leading empty match. This
behavior was specified in JDK8 and brings RE2/J split into line with
more recent JDK implementations.

Furthermore, the split function no longer needs determine the number of
matches before assembling the result. The upshot is that the number of
find() calls is halved in many cases. The benchmark in the previous
change shows a significant improvement.

Reference impl (JDK):
BenchmarkSplit.benchmarkSplit     JDK  avgt    5  14.217 ± 0.410  us/op

RE2J (before):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  95.807 ± 6.737  us/op

RE2J (after):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  49.092 ± 0.717  us/op
sjamesr added a commit to sjamesr/re2j that referenced this issue Jun 27, 2022
Fixes google#131.

This change modifies Pattern.split to omit a leading empty match. This
behavior was specified in JDK8 and brings RE2/J split into line with
more recent JDK implementations.

Furthermore, the split function no longer needs determine the number of
matches before assembling the result. The upshot is that the number of
find() calls is halved in many cases. The benchmark in the previous
change shows a significant improvement.

Reference impl (JDK):
BenchmarkSplit.benchmarkSplit     JDK  avgt    5  14.217 ± 0.410  us/op

RE2J (before):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  95.807 ± 6.737  us/op

RE2J (after):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  49.092 ± 0.717  us/op
sjamesr added a commit that referenced this issue Jun 27, 2022
Fixes #131.

This change modifies Pattern.split to omit a leading empty match. This
behavior was specified in JDK8 and brings RE2/J split into line with
more recent JDK implementations.

Furthermore, the split function no longer needs determine the number of
matches before assembling the result. The upshot is that the number of
find() calls is halved in many cases. The benchmark in the previous
change shows a significant improvement.

Reference impl (JDK):
BenchmarkSplit.benchmarkSplit     JDK  avgt    5  14.217 ± 0.410  us/op

RE2J (before):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  95.807 ± 6.737  us/op

RE2J (after):
BenchmarkSplit.benchmarkSplit    RE2J  avgt    5  49.092 ± 0.717  us/op
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants