-
Notifications
You must be signed in to change notification settings - Fork 12.6k
fix tokenizer for JetBrain Mellum #15045
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix tokenizer for JetBrain Mellum #15045
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you also follow the instructions from convert_hf_to_gguf_update.py
and run test-tokenizer-0
on the generated vocab file?
Nope I have to do that! Thanks for the reminder! |
@CISC |
Wrong file. :) |
When you ran |
I know it's already too much question on this PR, sorry about that. Is there a readme on how to introduce a "new" tokenizer? |
No problem. :) So, as mentioned, the first time you ran |
Sorry if I wasn't clear, delete those files again, you should not commit them, it's just so you can test locally. |
@CISC thanks for all the help and support, finally I've figure it out.
I might made a mistake in the order of step, but this seems to me teh right order. Anyway tokenizer model and inp/out are generated test passed. Thanks for the patience :) Re-converted JetBrains.Mellum-4b-base.f16.gguf - pass & generates Is there anything else? |
Sorry about that, files removed |
It doesn't really matter as long as you don't accidentally delete the files or miss the instruction about generating the vocab file. :) The normal order is as follows:
That's it, will merge when CIs are done. :) |
The llama based (LlamaForCausalLM) JetBrains/Mellum-4b-base is using
"tokenizer_class": "GPT2Tokenizer"
so I've added the support for it.Conversion originally failed on the master branch with:
The model successfully converted, quantized, and produce reasonable output.
First time doing this kind of change, let me know if there's anything else I have to do.