fix tokenizer for JetBrain Mellum #15045

csabakecskemeti · 2025-08-02T22:48:58Z

The llama based (LlamaForCausalLM) JetBrains/Mellum-4b-base is using "tokenizer_class": "GPT2Tokenizer" so I've added the support for it.

Conversion originally failed on the master branch with:

WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:** WARNING: The BPE pre-tokenizer was not recognized!
WARNING:hf-to-gguf:**          There are 2 possible reasons for this:
WARNING:hf-to-gguf:**          - the model has not been added to convert_hf_to_gguf_update.py yet
WARNING:hf-to-gguf:**          - the pre-tokenization config has changed upstream
WARNING:hf-to-gguf:**          Check your model files and convert_hf_to_gguf_update.py and update them accordingly.
WARNING:hf-to-gguf:** ref:     https://github.com/ggml-org/llama.cpp/pull/6920
WARNING:hf-to-gguf:**
WARNING:hf-to-gguf:** chkhsh:  a1e163ecab2e718a4c829d1148b6e86824ec36163bb71941c3dca9cd5ac25756
WARNING:hf-to-gguf:**************************************************************************************
WARNING:hf-to-gguf:

Traceback (most recent call last):
  File "/Users/csabakecskemeti/Documents/workspace/llama.cpp/convert_hf_to_gguf.py", line 1928, in set_vocab
    self._set_vocab_sentencepiece()
  File "/Users/csabakecskemeti/Documents/workspace/llama.cpp/convert_hf_to_gguf.py", line 945, in _set_vocab_sentencepiece
    tokens, scores, toktypes = self._create_vocab_sentencepiece()
  File "/Users/csabakecskemeti/Documents/workspace/llama.cpp/convert_hf_to_gguf.py", line 962, in _create_vocab_sentencepiece
    raise FileNotFoundError(f"File not found: {tokenizer_path}")
FileNotFoundError: File not found: /Users/csabakecskemeti/Documents/local_share/JetBrains.Mellum-4b-base/tokenizer.model

The model successfully converted, quantized, and produce reasonable output.

./build/bin/llama-simple -m /Users/csabakecskemeti/Documents/local_share/JetBrains.Mellum-4b-base-GGUF/JetBrains.Mellum-4b-base.Q4_K_M.gguf -n 2048 "write a python hello world"
...
llama_context: graph splits = 573 (with bs=64), 363 (with bs=1)
write a python hello world program

```python
#!/usr/bin/env python

print("Hello World")
...

First time doing this kind of change, let me know if there's anything else I have to do.

CISC

Did you also follow the instructions from convert_hf_to_gguf_update.py and run test-tokenizer-0 on the generated vocab file?

csabakecskemeti · 2025-08-03T17:31:25Z

Did you also follow the instructions from convert_hf_to_gguf_update.py and run test-tokenizer-0 on the generated vocab file?

Nope I have to do that! Thanks for the reminder!

csabakecskemeti · 2025-08-03T17:45:44Z

@CISC
Done
./build/bin/test-tokenizer-0 ./models/ggml-vocab-llama-bpe.gguf
...
Tests passed

CISC · 2025-08-03T17:56:56Z

@CISC Done ./build/bin/test-tokenizer-0 ./models/ggml-vocab-llama-bpe.gguf ... Tests passed

Wrong file. :)

CISC · 2025-08-03T17:59:07Z

When you ran convert_hf_to_gguf_update.py for the first time it updated convert_hf_to_gguf.py and told you how to create the vocab files for mellum (it won't tell you any more unless you revert convert_hf_to_gguf.py).

csabakecskemeti · 2025-08-03T18:19:20Z

I know it's already too much question on this PR, sorry about that.
SO I've generated the the model: ./build/bin/test-tokenizer-0 ./models/ggml-vocab-mellum.gguf
but how do we get the inp and out file for the test? The input seems the same so I've just copied the it from ggml-vocab-llama-bpe.gguf.inp, but what's the way to get the putput? I should use(copy) the ggml-vocab-gpt-2.gguf.out as the model using GPT2Tokenizer ?

Is there a readme on how to introduce a "new" tokenizer?

CISC · 2025-08-03T18:28:55Z

I know it's already too much question on this PR, sorry about that. SO I've generated the the model: ./build/bin/test-tokenizer-0 ./models/ggml-vocab-mellum.gguf but how do we get the inp and out file for the test? The input seems the same so I've just copied the it from ggml-vocab-llama-bpe.gguf.inp, but what's the way to get the putput? I should use(copy) the ggml-vocab-gpt-2.gguf.out as the model using GPT2Tokenizer ?

No problem. :)

So, as mentioned, the first time you ran convert_hf_to_gguf_update.py it did a few things, one of them was creating the .inp/.out files. If you don't have them any more you need to revert convert_hf_to_gguf.py and rerun it.

CISC · 2025-08-03T19:02:00Z

Sorry if I wasn't clear, delete those files again, you should not commit them, it's just so you can test locally.

csabakecskemeti · 2025-08-03T19:10:51Z

@CISC thanks for all the help and support, finally I've figure it out.
It was a littlebit catch of 22:

to be able to generate the test inputs and outputs for the model I've needed the tokenizer model gguf.
to have the tokenizer model I had to run the convert_hf_to_gguf_update.py this generated the model but also added the entry to convert_hf_to_gguf.py -> because of this the update won;t generate the test data :)
after this I had to remove the entry from convert_hf_to_gguf.py and rerun convert_hf_to_gguf_update.py
-> ./build/bin/test-tokenizer-0 ./models/ggml-vocab-mellum.gguf Passed

I might made a mistake in the order of step, but this seems to me teh right order.

Anyway tokenizer model and inp/out are generated test passed. Thanks for the patience :)

Re-converted JetBrains.Mellum-4b-base.f16.gguf - pass & generates
Quantization work - generates

Is there anything else?

csabakecskemeti · 2025-08-03T19:13:41Z

Sorry about that, files removed

CISC · 2025-08-03T19:17:32Z

I might made a mistake in the order of step, but this seems to me teh right order.

It doesn't really matter as long as you don't accidentally delete the files or miss the instruction about generating the vocab file. :)

The normal order is as follows:

Add entry in convert_hf_to_gguf_update.py
Run convert_hf_to_gguf_update.py
Create ggml-vocab-*.gguf
Run test-tokenizer-0 on it
Profit :)

Is there anything else?

That's it, will merge when CIs are done. :)

fix tokenizer for JetBrain Mellum

38bdbb8

github-actions bot added the python python script changes label Aug 2, 2025

Trigger CI

5f78e18

CISC approved these changes Aug 3, 2025

View reviewed changes

add tokenizer model and tokenizer tests

8e52042

tokenizer test files not needed

00962c7

CISC merged commit 97366dc into ggml-org:master Aug 3, 2025
49 of 51 checks passed

Nexesenex pushed a commit to Nexesenex/croco.cpp that referenced this pull request Aug 5, 2025

vocab : JetBrains Mellum pre-tokenizer (ggml-org#15045)

3fc1091

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix tokenizer for JetBrain Mellum #15045

fix tokenizer for JetBrain Mellum #15045

Uh oh!

csabakecskemeti commented Aug 2, 2025 •

edited

Loading

Uh oh!

CISC left a comment

Uh oh!

csabakecskemeti commented Aug 3, 2025

Uh oh!

csabakecskemeti commented Aug 3, 2025

Uh oh!

CISC commented Aug 3, 2025

Uh oh!

CISC commented Aug 3, 2025

Uh oh!

csabakecskemeti commented Aug 3, 2025

Uh oh!

CISC commented Aug 3, 2025

Uh oh!

CISC commented Aug 3, 2025

Uh oh!

csabakecskemeti commented Aug 3, 2025

Uh oh!

csabakecskemeti commented Aug 3, 2025

Uh oh!

CISC commented Aug 3, 2025

Uh oh!

Uh oh!

Uh oh!

fix tokenizer for JetBrain Mellum #15045

fix tokenizer for JetBrain Mellum #15045

Uh oh!

Conversation

csabakecskemeti commented Aug 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

csabakecskemeti commented Aug 3, 2025

Uh oh!

csabakecskemeti commented Aug 3, 2025

Uh oh!

CISC commented Aug 3, 2025

Uh oh!

CISC commented Aug 3, 2025

Uh oh!

csabakecskemeti commented Aug 3, 2025

Uh oh!

CISC commented Aug 3, 2025

Uh oh!

CISC commented Aug 3, 2025

Uh oh!

csabakecskemeti commented Aug 3, 2025

Uh oh!

csabakecskemeti commented Aug 3, 2025

Uh oh!

CISC commented Aug 3, 2025

Uh oh!

Uh oh!

Uh oh!

csabakecskemeti commented Aug 2, 2025 •

edited

Loading