-
Notifications
You must be signed in to change notification settings - Fork 638
Fix correct handling of chat multimodal inputs in TransformersMM class #1728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix correct handling of chat multimodal inputs in TransformersMM class #1728
Conversation
5a69228
to
26c2bd9
Compare
TL;DRAdded support for multimodal model chat templates to Major Changes
Feedback
Thanks @RobinPicard @rlouf for the support! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the PR! On top of the tiny comments, something I think we should consider is taking a list of string + assets as the value of the content
key on top of a list of dicts. The keys expected in those dicts seem mostly standardized so we could create the dicts for the user and hand them to the tokenizer chat templating method.
So the user would provide for instance:
{
"role": "user",
"content": ["What's on this image?", Image],
},
And we would turn it for them into:
{
"role": "user",
"content": [
{"type": "text", "text": "What's on this image?"},
{"type": "image", "image": Image},
],
},
What do you think? (providing directly the expected list of dicts would still be supported)
PS: you need to rebase your branch
Thanks for the feedback! You are right. I missed the input option for a list with text prompt and assets. Following your suggestion, I have changed the processing of the Chat object based on content type and made some other improvements spotted during tests. Changes
What do you think? |
3808408
to
5be207b
Compare
…dal model and update docs for the usage of Chat in TransformersMultiModal
… version is now > 1.2.0
…pdate tests to verify new functionality for multimodal chat
…types and raise errors for unsupported content types in TransformersMultiModalTypeAdapter
…and Batching examples
…sformers multimodal type adapter
…ssets and the mm chat template format
…ing asset and type keys
…cenario, and added Audio and Video tests
5be207b
to
11336dd
Compare
Description
This PR improves the handling of multimodal chat inputs for TransformersMultiModal models. It addresses the challenge of correctly applying chat templates to text while integrating diverse media assets (images, audio, video).
Changes
outlines/models/transformers.py
:format_chat_input
to explicitly separate multimodal assets (images, audio, video) from text messages within chat inputs;Chat
messages are passed toself.tokenizer.apply_chat_template
to generate text prompts with correct placeholders for multimodal content;docs/features/models/transformers_multimodal.md
:Call for Feedback
Looking for feedback about docs and testers regarding the Chat multimodal input handling using different models/media!