Skip to content

fix tokenizer for JetBrain Mellum #15045

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Aug 3, 2025

Conversation

csabakecskemeti
Copy link
Contributor

@csabakecskemeti csabakecskemeti commented Aug 2, 2025

The llama based (LlamaForCausalLM) JetBrains/Mellum-4b-base is using "tokenizer_class": "GPT2Tokenizer" so I've added the support for it.

Conversion originally failed on the master branch with:

WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!
WARNING:hf-to-gguf:**          There are 2 possible reasons for this:
WARNING:hf-to-gguf:**          - the model has not been added to convert_hf_to_gguf_update.py yet
WARNING:hf-to-gguf:**          - the pre-tokenization config has changed upstream
WARNING:hf-to-gguf:**          Check your model files and convert_hf_to_gguf_update.py and update them accordingly.
WARNING:hf-to-gguf:** ref:     https://github.com/ggml-org/llama.cpp/pull/6920
WARNING:hf-to-gguf:**
WARNING:hf-to-gguf:** chkhsh:  a1e163ecab2e718a4c829d1148b6e86824ec36163bb71941c3dca9cd5ac25756
WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:

Traceback (most recent call last):
  File "/Users/csabakecskemeti/Documents/workspace/llama.cpp/convert_hf_to_gguf.py", line 1928, in set_vocab
    self._set_vocab_sentencepiece()
  File "/Users/csabakecskemeti/Documents/workspace/llama.cpp/convert_hf_to_gguf.py", line 945, in _set_vocab_sentencepiece
    tokens, scores, toktypes = self._create_vocab_sentencepiece()
  File "/Users/csabakecskemeti/Documents/workspace/llama.cpp/convert_hf_to_gguf.py", line 962, in _create_vocab_sentencepiece
    raise FileNotFoundError(f"File not found: {tokenizer_path}")
FileNotFoundError: File not found: /Users/csabakecskemeti/Documents/local_share/JetBrains.Mellum-4b-base/tokenizer.model

The model successfully converted, quantized, and produce reasonable output.

./build/bin/llama-simple -m /Users/csabakecskemeti/Documents/local_share/JetBrains.Mellum-4b-base-GGUF/JetBrains.Mellum-4b-base.Q4_K_M.gguf -n 2048 "write a python hello world"
...
llama_context: graph splits = 573 (with bs=64), 363 (with bs=1)
write a python hello world program

```python
#!/usr/bin/env python

print("Hello World")
...

First time doing this kind of change, let me know if there's anything else I have to do.

@github-actions github-actions bot added the python python script changes label Aug 2, 2025
Copy link
Collaborator

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you also follow the instructions from convert_hf_to_gguf_update.py and run test-tokenizer-0 on the generated vocab file?

@csabakecskemeti
Copy link
Contributor Author

Did you also follow the instructions from convert_hf_to_gguf_update.py and run test-tokenizer-0 on the generated vocab file?

Nope I have to do that! Thanks for the reminder!

@csabakecskemeti
Copy link
Contributor Author

@CISC
Done
./build/bin/test-tokenizer-0 ./models/ggml-vocab-llama-bpe.gguf
...
Tests passed

@CISC
Copy link
Collaborator

CISC commented Aug 3, 2025

@CISC Done ./build/bin/test-tokenizer-0 ./models/ggml-vocab-llama-bpe.gguf ... Tests passed

Wrong file. :)

@CISC
Copy link
Collaborator

CISC commented Aug 3, 2025

When you ran convert_hf_to_gguf_update.py for the first time it updated convert_hf_to_gguf.py and told you how to create the vocab files for mellum (it won't tell you any more unless you revert convert_hf_to_gguf.py).

@csabakecskemeti
Copy link
Contributor Author

I know it's already too much question on this PR, sorry about that.
SO I've generated the the model: ./build/bin/test-tokenizer-0 ./models/ggml-vocab-mellum.gguf
but how do we get the inp and out file for the test? The input seems the same so I've just copied the it from ggml-vocab-llama-bpe.gguf.inp, but what's the way to get the putput? I should use(copy) the ggml-vocab-gpt-2.gguf.out as the model using GPT2Tokenizer ?

Is there a readme on how to introduce a "new" tokenizer?

@CISC
Copy link
Collaborator

CISC commented Aug 3, 2025

I know it's already too much question on this PR, sorry about that. SO I've generated the the model: ./build/bin/test-tokenizer-0 ./models/ggml-vocab-mellum.gguf but how do we get the inp and out file for the test? The input seems the same so I've just copied the it from ggml-vocab-llama-bpe.gguf.inp, but what's the way to get the putput? I should use(copy) the ggml-vocab-gpt-2.gguf.out as the model using GPT2Tokenizer ?

No problem. :)

So, as mentioned, the first time you ran convert_hf_to_gguf_update.py it did a few things, one of them was creating the .inp/.out files. If you don't have them any more you need to revert convert_hf_to_gguf.py and rerun it.

@CISC
Copy link
Collaborator

CISC commented Aug 3, 2025

Sorry if I wasn't clear, delete those files again, you should not commit them, it's just so you can test locally.

@csabakecskemeti
Copy link
Contributor Author

@CISC thanks for all the help and support, finally I've figure it out.
It was a littlebit catch of 22:

  • to be able to generate the test inputs and outputs for the model I've needed the tokenizer model gguf.
  • to have the tokenizer model I had to run the convert_hf_to_gguf_update.py this generated the model but also added the entry to convert_hf_to_gguf.py -> because of this the update won;t generate the test data :)
  • after this I had to remove the entry from convert_hf_to_gguf.py and rerun convert_hf_to_gguf_update.py
    -> ./build/bin/test-tokenizer-0 ./models/ggml-vocab-mellum.gguf Passed

I might made a mistake in the order of step, but this seems to me teh right order.

Anyway tokenizer model and inp/out are generated test passed. Thanks for the patience :)

Re-converted JetBrains.Mellum-4b-base.f16.gguf - pass & generates
Quantization work - generates

Is there anything else?

@csabakecskemeti
Copy link
Contributor Author

Sorry about that, files removed

@CISC
Copy link
Collaborator

CISC commented Aug 3, 2025

I might made a mistake in the order of step, but this seems to me teh right order.

It doesn't really matter as long as you don't accidentally delete the files or miss the instruction about generating the vocab file. :)

The normal order is as follows:

  • Add entry in convert_hf_to_gguf_update.py
  • Run convert_hf_to_gguf_update.py
  • Create ggml-vocab-*.gguf
  • Run test-tokenizer-0 on it
  • Profit :)

Is there anything else?

That's it, will merge when CIs are done. :)

@CISC CISC merged commit 97366dc into ggml-org:master Aug 3, 2025
49 of 51 checks passed
Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Aug 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants