Skip to content

Conversation

@omkar-334
Copy link

@omkar-334 omkar-334 commented Dec 9, 2025

Description of the change

Added presets for Devstral Small 1.1

Reference

Github Issue - #2333
Model HF - https://huggingface.co/mistralai/Devstral-Small-2507
The Devstral HF contains only tekken.json but we need tokenizers.json format. There are 2 solutions for this -

  1. Use the conversion script from huggingface/transformers - This converts the tekken.json and loads Autotokenizer
  2. Use tokenizer from mistralai/Mistral-Small-24B-Base-2501 - referencing this issue from unsloth, Since Devstral is just finetuned, we can use the earlier model to obtain tokenizer (yes, this model has a tokenizers.json format)

I've gone ahead with the Option 2 and implemented it in the code.
I've updated presets in mistral_presets.py, convert_mistral.py, and convert_mistral_checkpoints.py.

Colab Notebook

I could not load it in colab since the model is 24B and runtime is crashing, but i will try it in Modal/Lambda and attach the results here.

Doubts

  1. This is the kaggle link for the model - https://www.kaggle.com/models/mistral-ai/devstral-small-2507, but I'm unsure on the kaggle_handle format
  2. cc @sachinprasadhs let me know if there any changes to make

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @omkar-334, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the Devstral Small 1.1 model into the system by adding its configuration and adapting the tokenizer conversion process. The primary challenge addressed was the absence of a tokenizers.json file for the Devstral model, which was resolved by leveraging the tokenizer from a compatible Mistral base model. This ensures the new model can be properly utilized within the existing framework.

Highlights

  • New Model Integration: Presets for the Devstral Small 1.1 model have been added, which is a 24B parameter model finetuned from Mistral-Small-3.1.
  • Tokenizer Resolution: A specific solution was implemented to use the tokenizer from "mistralai/Mistral-Small-24B-Base-2501" for Devstral Small 1.1, addressing the issue where the original Devstral HF model lacked a tokenizers.json file.
  • Configuration Updates: Relevant configuration files, including mistral_presets.py, convert_mistral.py, and convert_mistral_checkpoints.py, were updated to support the new model and its unique tokenizer loading requirements.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds presets for the Devstral Small 1.1 model. The changes correctly handle the tokenizer issue by using the tokenizer from a compatible base model. However, there's a minor typo in the model description within the preset file. Additionally, the code in convert_mistral.py and convert_mistral_checkpoints.py for handling the special case of the 'devstral' model can be improved by using a more robust check and avoiding hardcoded strings to enhance maintainability and readability. I've provided suggestions to address these points.

omkar-334 and others added 4 commits December 9, 2025 17:54
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
@sachinprasadhs sachinprasadhs self-requested a review December 9, 2025 19:04
Copy link
Collaborator

@sachinprasadhs sachinprasadhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, please attach screenshots matching numerics, parameter count, tokenizer matching and output matching.

Comment on lines 224 to 230

if preset == "devstral_small_1_1":
hf_tokenizer = AutoTokenizer.from_pretrained(
"mistralai/Mistral-Small-24B-Base-2501"
)
else:
hf_tokenizer = AutoTokenizer.from_pretrained(hf_preset)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use tekken.json since they have mentioned Tokenizer: Utilizes a Tekken tokenizer with a 131k vocabulary size.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would need to add a dependancy on https://github.com/mistralai/mistral-common since transformers Autotokenizer does not support tekken.json

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, they have mentioned going forward they will only use tekken.json, what difference is between tokenizer.json from base model to devstral's tekken.json?

As I observed they included tokenizer.json also in the today's release of Devstral 2 model.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If adding dependency is required by observing other models of mistral, if they only have tekken.json like this model, then we can think of adding dependency.

Copy link
Author

@omkar-334 omkar-334 Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, they have mentioned going forward they will only use tekken.json, what difference is between tokenizer.json from base model to devstral's tekken.json?

As I observed they included tokenizer.json also in the today's release of Devstral 2 model.

I think they are including tokenizer.json so that people can continue using them until frameworks support tekken.json

This is the current state of their tokenizer formats for newer models -

  1. mistralai/Devstral-Small-2507 - tekken.json (Add Devstral Small 1.1 #2333)
  2. mistralai/Devstral-Small-2-24B-Instruct-2512 - tekken.json, tokenizer.json
  3. mistralai/Mistral-Small-24B-Base-2501 - tekken.json, tokenizer.json
  4. mistralai/Mistral-Small-3.1-24B-Base-2503 - tekken.json, tokenizer.json (Add Mistral-Small-3.1 #2334)
  5. mistralai/Ministral-3-8B-Base-2512 - tekken.json, tokenizer.json
  6. mistralai/Magistral-Small-2509 - tekken.json (Add Magistral to Keras-Hub #2314)
  7. mistralai/Voxtral-Mini-3B-2507 - tekken.json (Add Voxtral #2349)

Older Models -

  1. All of the mistral and mixtral models that are implemented in keras-hub include tokenizer.model and tokenizer.json.
  2. Hence, the keras-hub implementation loads the tokenizer using tokenizer.model file format.

My earlier changes do not work since we don't use the tokenizer.json format.
Going forward, we need to use tekken.json

transformers has started supporting the tekken tokenizer and has used the mistral-common as its backend for the Mistral models. (https://github.com/huggingface/transformers/blob/471d7ce9abbb3bc1b3bab673367378f9dbc3caac/src/transformers/tokenization_mistral_common.py)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, Thanks for putting all the details here. if Autotokenizer is supporting tekken.json loading, we can handle with if else condition.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can do that for hf_tokenizer, but how to support it in Keras-hub? Should we write a NewMistralTokenizer for tekken.json based models?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can add the underlying TikTokenizer implementation to Keras Hub here https://github.com/keras-team/keras-hub/tree/master/keras_hub/src/tokenizers unless NewMistralTokenizer is different than the TikTokenizer.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And an option to use base TikTokenizer in the Mistral model here https://github.com/keras-team/keras-hub/blob/master/keras_hub/src/models/mistral/mistral_tokenizer.py.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I'll work on that and update this. Till then I'll mark this PR as draft.
thanks !

@omkar-334 omkar-334 marked this pull request as draft December 11, 2025 05:03
@omkar-334
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds presets for the Devstral Small 1.1 model. A significant part of the change is the introduction of a new TiktokenTokenizer to handle tekken.json based tokenizers. However, there seems to be a contradiction between the implementation and the stated goal. The checkpoint conversion script is configured to use a tokenizer.model from a different model (mistralai/Mistral-Small-24B-Base-2501) for devstral_small_1_1, which means the newly added TiktokenTokenizer and related logic for handling tekken.json will not be used for this preset. Could you please clarify if the TiktokenTokenizer is intended for future use, or if the conversion script should be updated to use it?

Beyond this high-level point, I've found a few issues, including a critical bug in the tokenizer conversion logic and several style guide violations in the new tiktoken_tokenizer.py file. Please see the detailed comments below.

Comment on lines +277 to +278
# undo this in final commit
if tf is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change appears to be a temporary workaround, as indicated by the comment # undo this in final commit. This must be reverted before merging to ensure all required dependencies are checked correctly.

Suggested change
# undo this in final commit
if tf is None:
if tf_text is None or tf is None:

Comment on lines 64 to 71
"""
Tekken-based tokenizer for Mistral models.
Responsibilities:
• Add required Mistral special tokens (<s>, </s>, pad)
• Delegate tekken.json parsing to TiktokenTokenizer
• Use Tiktoken backend via TiktokenTokenizer normalisation
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for NewMistralTokenizer is missing the Args and Examples sections, which are required by the style guide (lines 168-184, 367-371). Please add them to improve documentation and usability.

References
  1. Docstrings should include comprehensive examples and document all parameters. (link)

Comment on lines +135 to +150
"""
Format-agnostic tiktoken tokenizer with Tekken support.
The tokenizer can consume:
• A normalized proto dict with keys:
{
"pattern": str,
"mergeable_ranks": Dict[base64|bytes → rank],
"special_tokens": list[{token_str, rank}] (optional),
"special_lookup": Dict[token_str → rank] (optional),
"num_special_tokens": int,
"vocab_size": int,
"raw_json": original JSON (optional)
}
• A Tekken JSON (path/bytes/dict). It will be normalized internally.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring for TiktokenTokenizer is missing the Args and Examples sections, which are required by the style guide (lines 529-530). Please add them to document the initializer parameters and show usage.

References
  1. Document thoroughly including call behavior through a class level docstring. Include usage examples. (link)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants