Is the index in Quikwit language independent? #1388
Answered
by
fmassot
HeenaBansal2009
asked this question in
Q&A
-
|
Hi @fmassot ,
Thanks. |
Beta Was this translation helpful? Give feedback.
Answered by
fmassot
May 9, 2022
Replies: 1 comment 1 reply
-
|
Currently, you can specify 2 tokenizers:
The code of the impl<'a> SimpleTokenStream<'a> {
// search for the end of the current token.
fn search_token_end(&mut self) -> usize {
(&mut self.chars)
.filter(|&(_, ref c)| !c.is_alphanumeric())
.map(|(offset, _)| offset)
.next()
.unwrap_or(self.text.len())
}
}In tantivy you have access to more tokenizers: ngram, stemming in latin languages, third party support for Japanese, Chineese...
I'm not sure to understand the query you want to do. Can you give a concrete example? |
Beta Was this translation helpful? Give feedback.
1 reply
Answer selected by
fmassot
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Currently, you can specify 2 tokenizers:
rawtokenizer that does nothingdefaulttokenizer that does the following: split on whitespace and punctuations (everything that is not alphanumeri), remove long token (> 40 bytes), lower case each token.The code of the
SimpleTokenizerused:In tantivy you have access to more tokenizers: ngram, stemming in latin languages, thir…