Skip to content

Conversation

laitifranz
Copy link
Contributor

Description

This PR improves the handling of multimodal chat inputs for TransformersMultiModal models. It addresses the challenge of correctly applying chat templates to text while integrating diverse media assets (images, audio, video).

Changes

  • outlines/models/transformers.py:
    • Refactored format_chat_input to explicitly separate multimodal assets (images, audio, video) from text messages within chat inputs;
    • Ensured that the complete Chat messages are passed to self.tokenizer.apply_chat_template to generate text prompts with correct placeholders for multimodal content;
  • docs/features/models/transformers_multimodal.md:
    • Updated documentation to reflect the new multimodal chat input handling capabilities;
    • Added a new usage example demonstrating how to utilize these features.

Call for Feedback

Looking for feedback about docs and testers regarding the Chat multimodal input handling using different models/media!

@laitifranz laitifranz force-pushed the multimodal-input-handling-via-chat-interface branch from 5a69228 to 26c2bd9 Compare August 23, 2025 16:11
@laitifranz
Copy link
Contributor Author

TL;DR

Added support for multimodal model chat templates to @format_input.register(Chat) function in HF Transformers MultiModal.

Major Changes

  • outlines/models/transformers.py:
    • Removed deprecated function @format_input.register(dict) and updated relative test since Outlines is now a version > 1.2.0;
    • In function @format_input.register(Chat), the input is now processed by collecting assets and applying the processor's chat templates to correctly set placeholders for assets without doing it manually.
  • outlines/inputs.py:
    • Updated Chat class docstring to include info regarding new multimodal chat format supported only by HF transformer MM models.
  • docs/features/models/transformers_multimodal.md:
    • Updated the guide with Chat and Batching examples.
  • tests/models/test_transformers_multimodal* and tests/test_inputs.py:
    • Updated chat prompt to the new format;
    • Added test for unsupported contents and invalid type input.

Feedback

  • All checks have successfully passed;
  • Personally tested in my projects for a week using Qwen/Qwen2.5-VL-7B-Instruct with complex chat messages;
  • Let me know if any changes should be made or if I am missing smth from my implementation. I tried to leave the implementation as clean as possible to avoid backward incompatibility. I know that multimodal chat templates are exclusive for HF transformers multimodal models right now, but it simplifies a lot the management of asset tags and multiple assets concatenated with a complex chat structure.

Thanks @RobinPicard @rlouf for the support!

@laitifranz laitifranz marked this pull request as ready for review August 24, 2025 08:31
Copy link
Contributor

@RobinPicard RobinPicard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the PR! On top of the tiny comments, something I think we should consider is taking a list of string + assets as the value of the content key on top of a list of dicts. The keys expected in those dicts seem mostly standardized so we could create the dicts for the user and hand them to the tokenizer chat templating method.

So the user would provide for instance:

{
    "role": "user",
    "content": ["What's on this image?", Image],
},

And we would turn it for them into:

{
    "role": "user",
    "content": [
        {"type": "text", "text": "What's on this image?"},
        {"type": "image", "image": Image},
    ],
},

What do you think? (providing directly the expected list of dicts would still be supported)

PS: you need to rebase your branch

@laitifranz
Copy link
Contributor Author

laitifranz commented Aug 27, 2025

Thanks for the feedback!

You are right. I missed the input option for a list with text prompt and assets. Following your suggestion, I have changed the processing of the Chat object based on content type and made some other improvements spotted during tests.

Changes

  • put back the tests for list+assets, while adding tests for multimodal chat template formats and raiseError testing;
  • migrate from AutoTokenizer to AutoProcessor in test_transformers_multimodal_type_adapter.py to match the MM scenario and test Image, Audio, Video assets;
  • added an example for list+assets in docs;
  • addressed your comments;
  • test_transformers_multimodal.py: needed to switch from urllib.request to requests with a user-agent header to fetch remote images because of this error:
    • ERROR test_transformers_multimodal.py::test_transformers_multimodal_simple - urllib.error.HTTPError: HTTP Error 403: Forbidden. I think it is smth related to bot prevention mechanisms.
  • a problem that I found, and I think we cannot do much because is python dict related, is when a user inputs a content type like {"type": "image", "image": Image(get_image(IMAGE_URLS[0])), "image": Image(get_image(IMAGE_URLS[1]))}, where we have an overwrite of the key image that cannot be catched by the tests;
  • I should successfully rebase the branch.
  • This PR should close the issue Improve the handling of inputs for TransformersMultiModal #1688

What do you think?

@laitifranz laitifranz force-pushed the multimodal-input-handling-via-chat-interface branch from 3808408 to 5be207b Compare August 30, 2025 09:31
…dal model and update docs for the usage of Chat in TransformersMultiModal
…pdate tests to verify new functionality for multimodal chat
…types and raise errors for unsupported content types in TransformersMultiModalTypeAdapter
@laitifranz laitifranz force-pushed the multimodal-input-handling-via-chat-interface branch from 5be207b to 11336dd Compare September 12, 2025 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants