-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Currently a couple of the apis talk in tokens, which is inconvenient. It would be nice if you could translate text into tokens and vise-versa easily.
The rust_tokenizer crate has a function called from_file that allows instantiating the GPT2 tokenizer given a couple pretrained tokenizer files. These files are available from huggingface's website here:
- vocab: https://huggingface.co/gpt2/resolve/main/vocab.json
- merges: https://huggingface.co/gpt2/resolve/main/merges.txt
- tokenizer: https://huggingface.co/gpt2/resolve/main/tokenizer.json
There is also an example in rust_bert
of constructing a gpt2 tokenizer. Ideally the tokenizer would be built lazily so users of the library don't need to pay for it unless they need the features.
Where to use it
It looks most like this will be useful with the logit_bias
feature, since the api requires you send the token number, rather than actual strings. Since the example code is in python, this is a bit of a barrier to users in rust.