Skip to content

Matching on WTF-8 strings and ECMAScript RegExp simulation #1279

Answered by BurntSushi
aapoalas asked this question in Q&A
Discussion options

You must be logged in to vote

I am rather trying to ponder on what it would mean to try simulate (by changing patterns) WTF-8 matching, or what it would mean to do that fundamentally in the regex crate or in a fork of it.

Shooting from the hip here, my best guess is that you might need something like regex_syntax::utf8, but for WTF-8. Specifically, that module provides APIs for taking sequences of Unicode scalar values to a corresponding byte-based automaton (see the doc examples in that module). Notice that it takes scalar values. Since WTF-8 is specifically designed to encode unpaired surrogates, you'd need something that takes in all possible Unicode codepoints. Arguably, you could do this by copying that module …

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@aapoalas
Comment options

@BurntSushi
Comment options

Answer selected by BurntSushi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants