-
Notifications
You must be signed in to change notification settings - Fork 1.6k
<regex>
: Process greedy simple loops non-recursively
#5790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
<regex>
: Process greedy simple loops non-recursively
#5790
Conversation
Thanks as always for the exceptionally clear writeup, careful changes, and added test coverage! 😻 I really appreciate the attention to avoiding disruption here:
I pushed a comment typo fix and minor test fixes. Also I noticed how you predict what the PR number is going to be, so your test mentions it correctly on the first try 😹 |
Towards #997 and #1528.
This implements processing of greedy simple loops in a non-recursive way. After this PR, we actually do greedy matching when greedy quantifiers are used. But in exchange, we have to store more frames in the
_Frames
vector. But I think we should now take the chance to actually process the regex as the drafter of the regex requested. (Before this PR, "more memory" would have meant "more stack", so the correct processing order would have resulted in many more stack overflows.)The greedy processing is implemented as follows:
_N_rep
, we first try to match the repeated pattern, if the maximum number of reps is greater than zero. We also push a frame with opcode_Loop_simple_greedy
in this case if the minimum number of reps is zero. If the maximum number of reps is zero, we instead try to match the remainder of the regex._N_end_rep
, we also first try to match the repeated pattern, if the maximum number of reps hasn't been reached yet, and push a frame with opcode_Loop_simple_greedy
if the minimum number of reps has been reached. If the maximum number of reps has been reached, we instead try to match the remainder of the regex. We also avoid UB now by checking for potential overflow of_Loop_idx
._Loop_simple_greedy
if matching has failed, resetting the position in the input and the state of capturing groups appropriately. (We do not have to handle_Longest
here because we always perform non-greedy matching when_Longest
is true.) As already argued in<regex>
: Process minimum number of reps in simple loops non-recursively #5762, we don't have to restore loop state here because simple loops are branchless and non-reentrant.The increase in size of the
_Frames
vector is no longer accurately reflected by the preexisting stack usage counter. But this is deliberate choice to preserve backwards compatibility: If the stack usage count were increased as well, we might throw aregex_error(error_stack)
on inputs that were previously accepted.It is possible to avoid this increase in size of the
_Frames
vector. But this would majorly complicate this PR and a few more changes I consider more relevant, so I would like to defer this to a later PR.I felt again that the existing test coverage was a bit insufficient, so I added a few more tests. This includes tests that greedy quantifiers and non-greedy quantifiers (and leftmost-longest mode) can yield matches of different length when searching.