-
Notifications
You must be signed in to change notification settings - Fork 12.4k
Improve Mistral models integration with llama.cpp #14737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Thanks for the contribution. From a developer perspective, it looks like a good approach to avoid any potential tokenization / formatting problems. In general, for all models, using a reference tokenizer instead of relying on My understanding is that most chat template problems occur during the early days of the model release, and with time tend to get polished and fixed. So this approach would be a stable alternative during such periods of instability. |
IIRC Mistral's architecture also makes use of sliding window attention (SWA), defaulting to a window size of 4096 tokens - though I don't know all the details (like which layers, if any, are full layers). It would be great if the window size could be stored in the GGUF file as well (e.g. as |
b809a96
to
2865a25
Compare
Hey guys many sorries for the delay of the answer and thanks a lot for your feedback.
Exactly what's cool with llama.cpp is that you support the possibility to pass jinja templates when serving so people can use them once they are correct if they want and remove the mistral-common server ! Very nice feature.
This is actually for super old (for Deep Learning ^^) models so we didn't add support to that. Could it be a subsequent PR ? Regarding the PR:
Happy to answer more questions :) |
Partially, there's also a |
@juliendenize Please undo all the formatting/style changes, they are not relevant and add too much noise to the PR, will review afterwards. :) |
This comment was marked as off-topic.
This comment was marked as off-topic.
Arf, would it be ok to up the requirements for Pydantic in your side or is it a no ? Was there a particular reason to stay at 2.6 ? |
Done sorry about that my own formatter was on. Is there a formatter or linter available for Python ? Didn't find it in the contributing guidelines. |
Yes, I think it's ok, it's probably just the version that was available at the time.
We don't use a Python formatter, only |
Pillow conflict, should be fine to update: |
Right, now we are getting somewhere. :) Edit: The unbound errors are clearly handled at init and can be silenced by |
@juliendenize do you also plan to make changes to |
Tried to make things cleaner sorry for the back and forth. |
that would be cool indeed, I didn't work personally on Voxtral (for the model), so I might need some assistance as I lack experience in audio models. Is voxtral already supported by llama.cpp ? I assumed that not for now. |
Yeah not for now, but I was trying to add support and ran into issues converting to GGUF. But that should be easy to add after this PR is merged, so don't worry about it for now :) |
Ok so this: https://github.com/ggml-org/llama.cpp/actions/runs/16500995835/job/46660394829?pr=14737 Is actually expected because we didn't merge the PR here yet in We're in the process of merging I'm just adding a final feature which is begin able to call |
Ok, ping me when you're ready. |
I've just had a deeper look into this PR. One concern though, most of the code inside Just thinking, maybe it's better to bring them right into Btw, I'm also working converting Voxtral to GGUF. I thought that would be simple but I'm currently stuck at the tokenizer. Trying a quick hack to copy some code from this PR.. will see if it work. |
Ok so as demo in #14862, I think it might be better to merge everything into
|
Sounds good to me. |
Description
This PR aims to enhance the integration of Mistral models with llama.cpp by addressing several key issues and introducing new features. Here are the details:
Context
Using mistral-common with llama.cpp
We recommend that users only use the
llama-server
tool with the/completions
route of the server for now, as it is the only one that supports tokens input. We also advise users to setreturn_tokens=True
in their requests to letmistral-common
handle detokenization.Added features
We have added a script to convert Mistral models to GGUF directly from Hugging Face. This script is located at
convert_mistral_to_gguf.py
and can be used to convert Mistral models to GGUF format.We registered the Mistral architecture in llama.cpp to support Mistral models natively. This allows users to use Mistral models with llama.cpp without having to convert them to Hugging Face first.
Known Limitations:
Our approach does not support multimodality:
Also this approach requires users to only use the llama.cpp server with the
/completions
route.Example Code
To get started, install mistral-common using the following command:
(Optional) Convert the model
Launch the mistral-common and llama.cpp servers
Launch the mistral-common server:
Launch the llama.cpp server:
Use the servers
Here is a code snippet demonstrating how to use the new features:
Feedback and Contributions
We believe these changes will significantly improve the integration of Mistral models with llama.cpp and provide a better experience for our users. We welcome any feedback or suggestions to further enhance this integration. Also, as we have few experience in the codebase of llama.cpp, we welcome any help to improve the integration and make sure we respect the codebase and the community.