-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Fix convert_tekken_tokenizer #42592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix convert_tekken_tokenizer #42592
Conversation
| if special_token in all_special: | ||
| tokenizer.add_special_tokens({special_key: special_token}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for backward and forward compatibility
|
cc @itazap |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: ministral3, mistral3 |
|
run-slow: ministral3, mistral3 |
|
This comment contains models: ["models/ministral3", "models/mistral3"] |
CI Results✅ No failing test specific to this PR 🎉 ! |
|
Hey! Thanks for the PR, can you please share a short reproducer of the problem (you mentioned in chat templates)? perhaps we'll need to add a test ! |
What does this PR do?
Right now the
convert_tekken_tokenizerdoes not addbos_tokens,eos_tokento the special tokens via theadd_special_tokensmethod.This prevents the chat templates that expect
eos_tokenandbos_tokento work properly.Previously this was working as when saving the tokenizer a
special_tokens_map.jsonwas created which is no longer the case. Unknown to me why but I'd assume this is due to the V5 refactoring ?This PR fixes that by adding explicitly these tokens to the tokenizer and when saving they're now stored in
tokenizer_config.json.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@ArthurZucker