Fix correct handling of chat multimodal inputs in TransformersMM class #1728

laitifranz · 2025-08-14T16:12:41Z

Description

This PR improves the handling of multimodal chat inputs for TransformersMultiModal models. It addresses the challenge of correctly applying chat templates to text while integrating diverse media assets (images, audio, video).

Changes

outlines/models/transformers.py:
- Refactored format_chat_input to explicitly separate multimodal assets (images, audio, video) from text messages within chat inputs;
- Ensured that the complete Chat messages are passed to self.tokenizer.apply_chat_template to generate text prompts with correct placeholders for multimodal content;
docs/features/models/transformers_multimodal.md:
- Updated documentation to reflect the new multimodal chat input handling capabilities;
- Added a new usage example demonstrating how to utilize these features.

Call for Feedback

Looking for feedback about docs and testers regarding the Chat multimodal input handling using different models/media!

laitifranz · 2025-08-24T08:30:42Z

TL;DR

Added support for multimodal model chat templates to @format_input.register(Chat) function in HF Transformers MultiModal.

Major Changes

outlines/models/transformers.py:
- Removed deprecated function @format_input.register(dict) and updated relative test since Outlines is now a version > 1.2.0;
- In function @format_input.register(Chat), the input is now processed by collecting assets and applying the processor's chat templates to correctly set placeholders for assets without doing it manually.
outlines/inputs.py:
- Updated Chat class docstring to include info regarding new multimodal chat format supported only by HF transformer MM models.
docs/features/models/transformers_multimodal.md:
- Updated the guide with Chat and Batching examples.
tests/models/test_transformers_multimodal* and tests/test_inputs.py:
- Updated chat prompt to the new format;
- Added test for unsupported contents and invalid type input.

Feedback

All checks have successfully passed;
Personally tested in my projects for a week using Qwen/Qwen2.5-VL-7B-Instruct with complex chat messages;
Let me know if any changes should be made or if I am missing smth from my implementation. I tried to leave the implementation as clean as possible to avoid backward incompatibility. I know that multimodal chat templates are exclusive for HF transformers multimodal models right now, but it simplifies a lot the management of asset tags and multiple assets concatenated with a complex chat structure.

Thanks @RobinPicard @rlouf for the support!

RobinPicard

Thanks a lot for the PR! On top of the tiny comments, something I think we should consider is taking a list of string + assets as the value of the content key on top of a list of dicts. The keys expected in those dicts seem mostly standardized so we could create the dicts for the user and hand them to the tokenizer chat templating method.

So the user would provide for instance:

{
    "role": "user",
    "content": ["What's on this image?", Image],
},

And we would turn it for them into:

{
    "role": "user",
    "content": [
        {"type": "text", "text": "What's on this image?"},
        {"type": "image", "image": Image},
    ],
},

What do you think? (providing directly the expected list of dicts would still be supported)

PS: you need to rebase your branch

outlines/inputs.py

outlines/models/transformers.py

laitifranz · 2025-08-27T11:14:41Z

Thanks for the feedback!

You are right. I missed the input option for a list with text prompt and assets. Following your suggestion, I have changed the processing of the Chat object based on content type and made some other improvements spotted during tests.

Changes

put back the tests for list+assets, while adding tests for multimodal chat template formats and raiseError testing;
migrate from AutoTokenizer to AutoProcessor in test_transformers_multimodal_type_adapter.py to match the MM scenario and test Image, Audio, Video assets;
added an example for list+assets in docs;
addressed your comments;
test_transformers_multimodal.py: needed to switch from urllib.request to requests with a user-agent header to fetch remote images because of this error:
- ERROR test_transformers_multimodal.py::test_transformers_multimodal_simple - urllib.error.HTTPError: HTTP Error 403: Forbidden. I think it is smth related to bot prevention mechanisms.
a problem that I found, and I think we cannot do much because is python dict related, is when a user inputs a content type like {"type": "image", "image": Image(get_image(IMAGE_URLS[0])), "image": Image(get_image(IMAGE_URLS[1]))}, where we have an overwrite of the key image that cannot be catched by the tests;
I should successfully rebase the branch.
This PR should close the issue Improve the handling of inputs for TransformersMultiModal #1688

What do you think?

…dal model and update docs for the usage of Chat in TransformersMultiModal

… version is now > 1.2.0

…pdate tests to verify new functionality for multimodal chat

…types and raise errors for unsupported content types in TransformersMultiModalTypeAdapter

…and Batching examples

…string errors

…sformers multimodal type adapter

…t errors

…ssets and the mm chat template format

…ing asset and type keys

…cenario, and added Audio and Video tests

laitifranz force-pushed the multimodal-input-handling-via-chat-interface branch from 5a69228 to 26c2bd9 Compare August 23, 2025 16:11

laitifranz marked this pull request as ready for review August 24, 2025 08:31

RobinPicard reviewed Aug 25, 2025

View reviewed changes

outlines/inputs.py Outdated Show resolved Hide resolved

outlines/models/transformers.py Outdated Show resolved Hide resolved

laitifranz force-pushed the multimodal-input-handling-via-chat-interface branch from 3808408 to 5be207b Compare August 30, 2025 09:31

laitifranz added 21 commits September 12, 2025 17:44

Fix correct handling of chat multimodal inputs in TransformersMultiMo…

c74ec6c

…dal model and update docs for the usage of Chat in TransformersMultiModal

Handle empty assets cases where onyl text is provided to MLLMs

187a875

Remove deprecated function and update test accordingly since Outlines…

647d8da

… version is now > 1.2.0

Update docstring with OpenAI’s multi-modal chat format support, and u…

dfe6d5d

…pdate tests to verify new functionality for multimodal chat

Clarify docstring for Chat class

21a5fe5

Enhance asset handling in format_chat_input method to validate asset …

dcc585b

…types and raise errors for unsupported content types in TransformersMultiModalTypeAdapter

Update tests for transformers MultiModal to the multimodal chat format

9e1ef4e

Update transformers multimodal docs to reflect new changes with Chat …

b956401

…and Batching examples

Use consistent string quoting for SUPPORTED_ASSETS to avoid Python f-…

b967a9a

…string errors

Add tests for handling empty assets and invalid content types in tran…

c82a198

…sformers multimodal type adapter

Removing SUPPORTED_ASSETS constant for clarity and to avoid pre-commi…

f58391c

…t errors

Apply pre-commit suggestions: trim trailing whitespace

f90272b

Remove or from Chat docstring for better readability

24c3a53

Update format_chat_input for Chat class to support list of string + a…

b852766

…ssets and the mm chat template format

Reintroduce test for list of string+assets

9983b8c

Reintroduce test for list of string+assets and add new tests for miss…

1705b70

…ing asset and type keys

Add example for using list of string + assets with Chat class

901d9be

Add tests for invalid content types and for unsupported asset types

1efc490

Migrate multimodal type adapter tests to use AutoProcessor for a MM s…

1e2715e

…cenario, and added Audio and Video tests

Added validation for items with multiple keys

7edb773

Add tests for mutiple assets usage in multimodal chat template format

11336dd

laitifranz force-pushed the multimodal-input-handling-via-chat-interface branch from 5be207b to 11336dd Compare September 12, 2025 16:22

Update text input prompt to align with author's preference

79e43e3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix correct handling of chat multimodal inputs in TransformersMM class #1728

Fix correct handling of chat multimodal inputs in TransformersMM class #1728

Uh oh!

laitifranz commented Aug 14, 2025

Uh oh!

laitifranz commented Aug 24, 2025

Uh oh!

RobinPicard left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

laitifranz commented Aug 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fix correct handling of chat multimodal inputs in TransformersMM class #1728

Are you sure you want to change the base?

Fix correct handling of chat multimodal inputs in TransformersMM class #1728

Uh oh!

Conversation

laitifranz commented Aug 14, 2025

Description

Changes

Call for Feedback

Uh oh!

laitifranz commented Aug 24, 2025

TL;DR

Major Changes

Feedback

Uh oh!

RobinPicard left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

laitifranz commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

Uh oh!

RobinPicard left a comment •

edited

Loading

laitifranz commented Aug 27, 2025 •

edited

Loading