Skip to content

Conversation

muellerj2
Copy link
Contributor

Towards #997 and #1528.

This implements processing of greedy simple loops in a non-recursive way. After this PR, we actually do greedy matching when greedy quantifiers are used. But in exchange, we have to store more frames in the _Frames vector. But I think we should now take the chance to actually process the regex as the drafter of the regex requested. (Before this PR, "more memory" would have meant "more stack", so the correct processing order would have resulted in many more stack overflows.)

The greedy processing is implemented as follows:

  • In _N_rep, we first try to match the repeated pattern, if the maximum number of reps is greater than zero. We also push a frame with opcode _Loop_simple_greedy in this case if the minimum number of reps is zero. If the maximum number of reps is zero, we instead try to match the remainder of the regex.
  • In _N_end_rep, we also first try to match the repeated pattern, if the maximum number of reps hasn't been reached yet, and push a frame with opcode _Loop_simple_greedy if the minimum number of reps has been reached. If the maximum number of reps has been reached, we instead try to match the remainder of the regex. We also avoid UB now by checking for potential overflow of _Loop_idx.
  • During backtracking, we try to match the tail when processing _Loop_simple_greedy if matching has failed, resetting the position in the input and the state of capturing groups appropriately. (We do not have to handle _Longest here because we always perform non-greedy matching when _Longest is true.) As already argued in <regex>: Process minimum number of reps in simple loops non-recursively #5762, we don't have to restore loop state here because simple loops are branchless and non-reentrant.

The increase in size of the _Frames vector is no longer accurately reflected by the preexisting stack usage counter. But this is deliberate choice to preserve backwards compatibility: If the stack usage count were increased as well, we might throw a regex_error(error_stack) on inputs that were previously accepted.

It is possible to avoid this increase in size of the _Frames vector. But this would majorly complicate this PR and a few more changes I consider more relevant, so I would like to defer this to a later PR.

I felt again that the existing test coverage was a bit insufficient, so I added a few more tests. This includes tests that greedy quantifiers and non-greedy quantifiers (and leftmost-longest mode) can yield matches of different length when searching.

@muellerj2 muellerj2 requested a review from a team as a code owner October 18, 2025 19:58
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Oct 18, 2025
@StephanTLavavej StephanTLavavej added enhancement Something can be improved regex meow is a substring of homeowner labels Oct 20, 2025
@StephanTLavavej StephanTLavavej self-assigned this Oct 20, 2025
@StephanTLavavej
Copy link
Member

Thanks as always for the exceptionally clear writeup, careful changes, and added test coverage! 😻

I really appreciate the attention to avoiding disruption here:

The increase in size of the _Frames vector is no longer accurately reflected by the preexisting stack usage counter. But this is deliberate choice to preserve backwards compatibility: If the stack usage count were increased as well, we might throw a regex_error(error_stack) on inputs that were previously accepted.

I pushed a comment typo fix and minor test fixes. Also I noticed how you predict what the PR number is going to be, so your test mentions it correctly on the first try 😹

@StephanTLavavej StephanTLavavej removed their assignment Oct 20, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Something can be improved regex meow is a substring of homeowner

Projects

Status: Ready To Merge

Development

Successfully merging this pull request may close these issues.

2 participants