Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PEG look rule sometimes produces captures, despite documentation otherwise #1554

Closed
edsrzf opened this issue Feb 2, 2025 · 5 comments
Closed
Labels
bug This is not expected behavior, and needs fixing

Comments

@edsrzf
Copy link

edsrzf commented Feb 2, 2025

The PEG look rule documentation states "patt will not produce captures". However it does sometimes produce captures. For example:

(peg/match '(* "a" (> '1)) "abc")

Result: @["b"]

However sometimes it doesn't:

(peg/match '(* "a" (? (> '1))) "abc")

Result: @[]

I think it would be pretty useful if look consistently produced captures, but either way I think the behavior and documentation should be consistent. (If we do decide to produce captures, please also consider #1553.)

@sogaiu
Copy link
Contributor

sogaiu commented Feb 2, 2025

look's documentation is currently (with slight emphasis added):

(look offset ?patt)

If patt is provided, matches only if patt matches at a fixed offset. offset can be any integer. patt will not produce captures and the peg will not advance any characters. If patt is not provided, matching occurs as if the call had been (look 0 offset).

I interpret that to mean that the first argument offset is required and that it should [1] be an integer. There can also be a second argument patt which can be something other than an integer, but the argument is optional.

The example posted uses (> '1), which rewritten I think is equivalent to (look (capture 1)). That seems to me to correspond to the pattern (look offset) which has only one argument. So the argument should be an integer, but in (> '1), it is not.

What do you think? Is that interpretation off?

Update: I think it is off (^^;


[1] It is true that the word "can" is used in the description:

offset can be any integer.

but I think the intent was to exclude possibilities other than integers.

@edsrzf
Copy link
Author

edsrzf commented Feb 2, 2025

Sorry, I think I've confused things by using an integer and single quote in my example. This is equivalent and also produces a capture:

(peg/match '(* "a" (> (capture "b"))) "abc")

Produces: @["b"]

If I specify a non-zero offset, it also produces a capture:

(peg/match '(* "a" (> 1 (capture "c"))) "abc")

Produces: @["c"]

These modifications still match, but don't produce any captures:

(peg/match '(* "a" (? (> (capture "b")))) "abc")
(peg/match '(* "a" (? (> 1 (capture "c")))) "abc")

@bakpakin bakpakin added the bug This is not expected behavior, and needs fixing label Feb 3, 2025
@sogaiu
Copy link
Contributor

sogaiu commented Feb 3, 2025

I agree there is a discrepancy between the documentation and the behavior, at least for:

(peg/match '(* "a" (> 1 (capture "c"))) "abc")

If I had to guess, I would say that one intent for stating that patt does not produce captures is so that patt can be used in the process of doing the "looking" in a convenient way that is less disruptive / more convenient to the rest of the overall processing [1]. May be that's off and/or there are other reasons.


[1] May be part of the idea was to not have to use drop?

@bakpakin
Copy link
Member

bakpakin commented Feb 3, 2025

If patt is provided, matches only if patt matches at a fixed offset. offset can be any integer. patt will not produce captures and the peg will not advance any characters. If patt is not provided, matching occurs as if the call had been (look 0 offset).

I don't know when it was decided that lookahead should not produce captures, but in most cases they do. We should fix the docs here.

The bug here is actually in the behavior of ?, which is implemented as (between 0 1 rule). There is code in the implementation of between to detect infinite loops where rule is matching indefinitely while making no forward progress (not advancing). Since the look ahead rule technically makes no forward progress, it's counts as a termination rule inside of a loop.

Consider the code:

(peg/match '(any (> '1)) "abc")

without the termination check, this peg would capture an infinite number of "a"s. The current implementation will detect this and bail after the first match with no progress, though.

However, I would argue that there is no need for this termination check at all for the ? rule.

There are a number of fixes to this, but this is a strange interaction between ? and >.

bakpakin added a commit that referenced this issue Feb 3, 2025
optionals.

This fixes some undesirable interplay between lookaheads and looping
(repeating) rules where lookaheads caused the loop to immediately
terminate _and_ discard any captures. This change adds an exception if
there is a 0-width match at the first iteration of the loop, in which
case captures will be kept, although the loop will still be terminated
(any further iteration of the loop would possibly cause an infinite
 loop).

This allows optionals to work as expected when combined with lookahead.
There is also a question of whether or not optional should be
implemented as separate rule rather than as `(between 0 1 subrule)`.
@sogaiu
Copy link
Contributor

sogaiu commented Feb 3, 2025

@edsrzf Thanks for making this issue. In addition to its value related to helping to improve Janet, it surfaced some bugs in my understanding and own code (^^;


I have an idea for modifying the documentation of look here, would you mind taking a look?

In addition to trying to address:

I don't know when it was decided that lookahead should not produce captures, but in most cases they do. We should fix the docs here.

I've tried to make some changes that might reduce confusion (though may be it was just me!).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This is not expected behavior, and needs fixing
Projects
None yet
Development

No branches or pull requests

3 participants