Matching on WTF-8 strings and ECMAScript RegExp simulation #1279

aapoalas · 2025-08-21T16:27:47Z

aapoalas
Aug 21, 2025

Hey, I'm using the regex crate to implement the ECMAScript RegExp type in Nova JavaScript engine; the engine uses WTF-8 as the internal representation for strings and of course tries to get the ECMAScript RegExp specification implemented as best as possible.

I am of course aware that regex does not aim to match any specific language's regular expression syntax, and I'm not exactly looking to change that. I am rather trying to ponder on what it would mean to try simulate (by changing patterns) WTF-8 matching, or what it would mean to do that fundamentally in the regex crate or in a fork of it.

What I mean by WTF-8 matching is, first and foremost, to allow matching on "unmatched surrogates" in a WTF-8 byte sequence. For example:

use regex::bytes::Regex;

fn main() {
    let haystack = [237, 161, 130];
    let re = Regex::new(r".").unwrap();
    assert_eq!(re.find(&haystack).map(|r| r.range()), Some(0..3));
}

Here we attempt to match an unmatched surrogate (\ud842 to be specific) in the default Unicode mode; this of course does not match and the assertion fails. This is absolutely understandable, but in the ECMAScript RegExp this does match even in the equivalent "Unicode" mode:

#!/usr/bin/env node
console.log(/./v.test("\ud842")); // true

To "correctly" match these, I'd need to rewrite the . pattern to a combined character class that also matches any unmatched surrogate byte sequence. The same would go for a Unicode range pattern, for example:

use regex::bytes::Regex;

fn main() {
    let haystack = [237, 161, 130];
    let re = Regex::new(r"[\u0128-\uffff]").unwrap();
    assert_eq!(re.find(&haystack).map(|r| r.range()), Some(0..3));
}

#!/usr/bin/env node
console.log(/[\u0128-\uffff]/v.test("\ud842")); // true

Now I would need to rewrite the range to a character class matching any UTF-8 character in this range, or a WTF-8 lone surrogate in this range. Again, complicated but possible. It's worth noting here that the same patterns in JavaScript work regardless of the v "Unicode Sets mode" flag; the RegExp can be in the normal "UCS-2" mode and the \uffff Unicode escapes are still allowed in the pattern; the same does not of course apply to the regex crate, so if I want to fully support the ECMAScript spec's definition of RegExp I will actually have to turn off the Unicode mode and basically rewrite all character classes manually from bytes. Not impossible, but that's pretty complicated indeed.

ECMAScript RegExp of course also allows unmatched surrogates to appear in Unicode escapes. For example, this pattern fails to compile:

use regex::bytes::Regex;

fn main() {
    let haystack = "𠮷".as_bytes();
    let re = Regex::new(r"\udfb7").unwrap();
    assert_eq!(re.find(&haystack).map(|r| r.range()), Some(0..4));
}

But in JS it's considered fine:

#!/usr/bin/env node
console.log(/\udfb7/v.test("𠮷")); // false
console.log(/\udfb7/u.test("𠮷")); // false
console.log(/\udfb7/.test("𠮷"));  // true

Note that in the ECMAScript RegExp Unicode modes the lone surrogate in the pattern does not match that same surrogate as part of a surrogate pair, but in the old "UCS-2" mode it does match. Lovely, innit? This same complication with the non-Unicode mode extends to the . character class as well: if we perform /./match("𠮷") then it will actually produce two matches; the one surrogate of the surrogate pair at a time.

A final complication is that at its core, I do want to use regex's Unicode mode always (as I am matching on nearly-UTF-8 strings), but unfortunately it also changes the meaning of eg. . or \w and so on; in the ECMAScript RegExp specification those would (I believe) depend on the presence of the u or v flags. It is mildly unfortunate that I cannot control those character classes separately.

Now, I will reiterate that I do not expect regex to suddenly jump to my aid and implement any or all of this: it's not even exactly clear to me what that would mean! (Matching WTF-8 encoded unpaired surrogates and allowing them in Unicode escape input I would assume isn't too big of a difference, but "splitting" of Unicode characters into two unpaired WTF-8 surrogates is not possible with the Match type returning a reference into the source text slice. The best possible API I can imagine here would be to return effectively the same character Match twice, but with some extra field that tells if the result refers to the lower or upper "half" of the surrogate pair.) I'm more... fishing for thoughts.

Sorry to be a bother, and cheers <3

Originally posted by @aapoalas in #1253 (comment)

Answered by BurntSushi

Aug 21, 2025

I am rather trying to ponder on what it would mean to try simulate (by changing patterns) WTF-8 matching, or what it would mean to do that fundamentally in the regex crate or in a fork of it.

Shooting from the hip here, my best guess is that you might need something like regex_syntax::utf8, but for WTF-8. Specifically, that module provides APIs for taking sequences of Unicode scalar values to a corresponding byte-based automaton (see the doc examples in that module). Notice that it takes scalar values. Since WTF-8 is specifically designed to encode unpaired surrogates, you'd need something that takes in all possible Unicode codepoints. Arguably, you could do this by copying that module …

View full answer

BurntSushi · 2025-08-21T17:11:09Z

BurntSushi
Aug 21, 2025
Maintainer

I am rather trying to ponder on what it would mean to try simulate (by changing patterns) WTF-8 matching, or what it would mean to do that fundamentally in the regex crate or in a fork of it.

Shooting from the hip here, my best guess is that you might need something like regex_syntax::utf8, but for WTF-8. Specifically, that module provides APIs for taking sequences of Unicode scalar values to a corresponding byte-based automaton (see the doc examples in that module). Notice that it takes scalar values. Since WTF-8 is specifically designed to encode unpaired surrogates, you'd need something that takes in all possible Unicode codepoints. Arguably, you could do this by copying that module and then simplifying the code.

For reference, here is where the regex crate uses regex_syntax::utf8 to "lower" the corresponding Unicode scalar value ranges into a byte based automaton:

regex/regex-automata/src/nfa/thompson/compiler.rs

Lines 1360 to 1447 in 01e2330

    
               /// Compile the given Unicode character class. 
        
               /// 
        
               /// This routine specifically tries to use various types of compression, 
        
               /// since UTF-8 automata of large classes can get quite large. The specific 
        
               /// type of compression used depends on forward vs reverse compilation, and 
        
               /// whether NFA shrinking is enabled or not. 
        
               /// 
        
               /// Aside from repetitions causing lots of repeat group, this is like the 
        
               /// single most expensive part of regex compilation. Therefore, a large part 
        
               /// of the expense of compilation may be reduce by disabling Unicode in the 
        
               /// pattern. 
        
               /// 
        
               /// This routine compiles an empty character class into a "fail" state. 
        
               fn c_unicode_class( 
        
                   &self, 
        
                   cls: &hir::ClassUnicode, 
        
               ) -> Result<ThompsonRef, BuildError> { 
        
                   // If all we have are ASCII ranges wrapped in a Unicode package, then 
        
                   // there is zero reason to bring out the big guns. We can fit all ASCII 
        
                   // ranges within a single sparse state. 
        
                   if cls.is_ascii() { 
        
                       let end = self.add_empty()?; 
        
                       let mut trans = Vec::with_capacity(cls.ranges().len()); 
        
                       for r in cls.iter() { 
        
                           // The unwraps below are OK because we've verified that this 
        
                           // class only contains ASCII codepoints. 
        
                           trans.push(Transition { 
        
                               // FIXME(1.59): use the 'TryFrom<char> for u8' impl. 
        
                               start: u8::try_from(u32::from(r.start())).unwrap(), 
        
                               end: u8::try_from(u32::from(r.end())).unwrap(), 
        
                               next: end, 
        
                           }); 
        
                       } 
        
                       Ok(ThompsonRef { start: self.add_sparse(trans)?, end }) 
        
                   } else if self.is_reverse() { 
        
                       if !self.config.get_shrink() { 
        
                           // When we don't want to spend the extra time shrinking, we 
        
                           // compile the UTF-8 automaton in reverse using something like 
        
                           // the "naive" approach, but will attempt to re-use common 
        
                           // suffixes. 
        
                           self.c_unicode_class_reverse_with_suffix(cls) 
        
                       } else { 
        
                           // When we want to shrink our NFA for reverse UTF-8 automata, 
        
                           // we cannot feed UTF-8 sequences directly to the UTF-8 
        
                           // compiler, since the UTF-8 compiler requires all sequences 
        
                           // to be lexicographically sorted. Instead, we organize our 
        
                           // sequences into a range trie, which can then output our 
        
                           // sequences in the correct order. Unfortunately, building the 
        
                           // range trie is fairly expensive (but not nearly as expensive 
        
                           // as building a DFA). Hence the reason why the 'shrink' option 
        
                           // exists, so that this path can be toggled off. For example, 
        
                           // we might want to turn this off if we know we won't be 
        
                           // compiling a DFA. 
        
                           let mut trie = self.trie_state.borrow_mut(); 
        
                           trie.clear(); 
        
                           for rng in cls.iter() { 
        
                               for mut seq in Utf8Sequences::new(rng.start(), rng.end()) { 
        
                                   seq.reverse(); 
        
                                   trie.insert(seq.as_slice()); 
        
                               } 
        
                           } 
        
                           let mut builder = self.builder.borrow_mut(); 
        
                           let mut utf8_state = self.utf8_state.borrow_mut(); 
        
                           let mut utf8c = 
        
                               Utf8Compiler::new(&mut *builder, &mut *utf8_state)?; 
        
                           trie.iter(|seq| { 
        
                               utf8c.add(&seq)?; 
        
                               Ok(()) 
        
                           })?; 
        
                           utf8c.finish() 
        
                       } 
        
                   } else { 
        
                       // In the forward direction, we always shrink our UTF-8 automata 
        
                       // because we can stream it right into the UTF-8 compiler. There 
        
                       // is almost no downside (in either memory or time) to using this 
        
                       // approach. 
        
                       let mut builder = self.builder.borrow_mut(); 
        
                       let mut utf8_state = self.utf8_state.borrow_mut(); 
        
                       let mut utf8c = 
        
                           Utf8Compiler::new(&mut *builder, &mut *utf8_state)?; 
        
                       for rng in cls.iter() { 
        
                           for seq in Utf8Sequences::new(rng.start(), rng.end()) { 
        
                               utf8c.add(seq.as_slice())?; 
        
                           } 
        
                       } 
        
                       utf8c.finish() 
        
                   }

A final complication is that at its core, I do want to use regex's Unicode mode always (as I am matching on nearly-UTF-8 strings), but unfortunately it also changes the meaning of eg. . or \w and so on; in the ECMAScript RegExp specification those would (I believe) depend on the presence of the u or v flags. It is mildly unfortunate that I cannot control those character classes separately.

You can! e.g., (?-u:.)(?u:.) will match any byte (sans \n) followed by any UTF-8 encoding of a Unicode scalar value.

Now, you can't control this within a character class. That is, you can't do (?-u)[\w(?u:\u0000-\uFFFF)] or something. In that case, you'd have to express it as (?-u)(?:\w|(?u:[\u0000-\uFFFF]). You'd just need to be careful that using an alternation doesn't change the semantics versus a character class. As long as you stick to matching single codepoints, I think you should be fine here.

So if I were you and I wanted to implement this in the quickest cheapest way possible... what would I do? The requirement for supporting \ud842 unfortunately imposes a pretty clear implementation path. You have to use your own regex parser because regex-syntax will simply reject such things and there are no knobs to let you work around that. Moreover, the regex_syntax::hir::ClassUnicodeRange that Unicode scalar value ranges get translated to are strictly confined to scalar values. So, e.g., \uD842 would be illegal there. That also suggests an implementation path.

I think what I'd do is something like this:

Fork regex-syntax and regex-automata and modify the parser (the thing that generates an AST) to allow parsing escape sequences for surrogate codepoints.
Change regex_syntax::utf8 to regex_syntax::wtf8 and make the necessary changes to support all codepoints.
Modify the HIR in regex-syntax to use u32 instead of char for representing codepoints. (This is likely to be a thorny change.)
In regex-automata, use your new regex_syntax::wtf8 in place of wherever regex_syntax::utf8 is used.

A possible alternative I considered was to make the changes to the AST, but before translating to HIR, expand all of the Unicode codepoint ranges into an equivalent HIR (an alternation of concatenations of regex_syntax::hir::ClassBytes). But this is quite tricky because you'll lose all of the HIR's infrastructure for dealing with Unicode character classes (of which there is a considerable amount). The upside of this is that you would get an HIR out of it, without any changes to the HIR's definition, and then everything downstream should "just work." But I think my suggested route above is probably easier.

It's quite possible that this is a lot more work than I let on.

It seems like you might already know this, but there are considerable differences between ECMAScript regexes and this crate. The Unicode surrogate codepoint handling is just one of them. Before spending a bunch of time trying to paper over this particular incompatibility with regex, make sure you can stomach the rest of them.

2 replies

aapoalas Aug 21, 2025
Author

Thank you; this is great insight and I really appreciate it! Regarding the differences mentioned

It seems like you might already know this, but there are considerable differences between ECMAScript regexes and this crate. The Unicode surrogate codepoint handling is just one of them. Before spending a bunch of time trying to paper over this particular incompatibility with regex, make sure you can stomach the rest of them.

I am aware that it is different, but do you happen to have clearer view of differences beyond codepoint handling and lookbehind/lookahead/lookaround support? I know you don't support back-references and are unlikely to do so. I've also noted that the Unicode group support is a little different; some groups in regex have wider support than ECMAScript tests expect. (I think so anyway; this might also just be a u/v mode difference that of course doesn't have an equivalence in regex.) Beyond that, you support some features that ECMAScript doesn't, like [[::foo::]] groups and set operations on character classes, but otherwise it all seems pretty equivalent.

BurntSushi Aug 21, 2025
Maintainer

It would be expensive for me to carefully enumerate all of the differences, particularly since I am not an ECMaScript expert.

I would look into handling of empty matches, particularly in an iterative context.

Also the definition of \w, even when both are Unicode aware, may be different.

How capture groups are handled, particularly when inside a repetition. Or duplicatively named groups.

And I wouldn't be surprised if there were a whole mess of differences in how Unicode groups are handled. IIRC, this crate allows permissive matching where as ECMAScript does not. There are probably also differences in which groups are supported.

There are probably more differences.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Matching on WTF-8 strings and ECMAScript RegExp simulation #1279

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Matching on WTF-8 strings and ECMAScript RegExp simulation #1279

Uh oh!

Uh oh!

aapoalas Aug 21, 2025

Replies: 1 comment · 2 replies

Uh oh!

BurntSushi Aug 21, 2025 Maintainer

Uh oh!

aapoalas Aug 21, 2025 Author

Uh oh!

BurntSushi Aug 21, 2025 Maintainer

aapoalas
Aug 21, 2025

Replies: 1 comment 2 replies

BurntSushi
Aug 21, 2025
Maintainer

aapoalas Aug 21, 2025
Author

BurntSushi Aug 21, 2025
Maintainer