Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
f97f8d9
Fix correct handling of chat multimodal inputs in TransformersMultiMo…
laitifranz Aug 14, 2025
37cd780
Handle empty assets cases where onyl text is provided to MLLMs
laitifranz Aug 16, 2025
842031d
Remove deprecated function and update test accordingly since Outlines…
laitifranz Aug 22, 2025
afb6de8
Update docstring with OpenAI’s multi-modal chat format support, and u…
laitifranz Aug 23, 2025
dbce9d5
Clarify docstring for Chat class
laitifranz Aug 23, 2025
93caf32
Enhance asset handling in format_chat_input method to validate asset …
laitifranz Aug 23, 2025
2a0c663
Update tests for transformers MultiModal to the multimodal chat format
laitifranz Aug 23, 2025
8a7741f
Update transformers multimodal docs to reflect new changes with Chat …
laitifranz Aug 23, 2025
0b8e8e8
Use consistent string quoting for SUPPORTED_ASSETS to avoid Python f-…
laitifranz Aug 23, 2025
a3b873b
Add tests for handling empty assets and invalid content types in tran…
laitifranz Aug 23, 2025
a76b3ae
Removing SUPPORTED_ASSETS constant for clarity and to avoid pre-commi…
laitifranz Aug 23, 2025
4b2c82e
Apply pre-commit suggestions: trim trailing whitespace
laitifranz Aug 23, 2025
70667a2
Remove or from Chat docstring for better readability
laitifranz Aug 26, 2025
99a9547
Update format_chat_input for Chat class to support list of string + a…
laitifranz Aug 27, 2025
b10fa47
Reintroduce test for list of string+assets
laitifranz Aug 27, 2025
30ab2ea
Reintroduce test for list of string+assets and add new tests for miss…
laitifranz Aug 27, 2025
2854765
Add example for using list of string + assets with Chat class
laitifranz Aug 27, 2025
8c8c6d2
Add tests for invalid content types and for unsupported asset types
laitifranz Aug 27, 2025
d5488f0
Migrate multimodal type adapter tests to use AutoProcessor for a MM s…
laitifranz Aug 27, 2025
df1c9c0
Added validation for items with multiple keys
laitifranz Aug 27, 2025
b604ccc
Add tests for mutiple assets usage in multimodal chat template format
laitifranz Aug 27, 2025
989bee0
Update text input prompt to align with author's preference
laitifranz Sep 12, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
189 changes: 180 additions & 9 deletions docs/features/models/transformers_multimodal.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,18 @@ The Outlines `TransformersMultiModal` model inherits from `Transformers` and sha

To load the model, you can use the `from_transformers` function. It takes 2 arguments:

- `model`: a `transformers` model (created with `AutoModelForCausalLM` for instance)
- `model`: a `transformers` model (created with `AutoModelForImageTextToText` for instance)
- `tokenizer_or_processor`: a `transformers` processor (created with `AutoProcessor` for instance, it must be an instance of `ProcessorMixin`)

For instance:

```python
import outlines
from transformers import AutoModelForCausalLM, AutoProcessor
from transformers import AutoModelForImageTextToText, AutoProcessor

# Create the transformers model and processor
hf_model = AutoModelForCausalLM.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
hf_processor = AutoProcessor.from_pretrained("microsoft/Phi-3-mini-4k-instruct")
hf_model = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")
hf_processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

# Create the Outlines model
model = outlines.from_transformers(hf_model, hf_processor)
Expand Down Expand Up @@ -76,11 +76,186 @@ result = model(
print(result) # '{"specie": "cat", "color": "white", "weight": 4}'
print(Animal.model_validate_json(result)) # specie=cat, color=white, weight=4
```
!!! Warning

Make sure your prompt contains the tags expected by your processor to correctly inject the assets in the prompt. For some vision multimodal models for instance, you need to add as many `<image>` tags in your prompt as there are image assets included in your model input. `Chat` method, instead, does not require this step.


The `TransformersMultiModal` model supports batch generation. To use it, invoke the `batch` method with a list of lists. You will receive as a result a list of completions.
### Chat
The `Chat` interface offers a more convenient way to work with multimodal inputs. You don't need to manually add asset tags like `<image>`. The model's HF processor handles the chat templating and asset placement for you automatically.
To do so, call the model with a `Chat` instance using a multimodal chat format. Assets must be pre-processed as `outlines.inputs.{Image, Audio, Video}` format, and only `image`, `video`, and `audio` types are supported.

For instance:

```python
import outlines
from outlines.inputs import Chat, Image
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image as PILImage
from io import BytesIO
from urllib.request import urlopen
import torch

model_kwargs = {
"torch_dtype": torch.bfloat16,
"attn_implementation": "flash_attention_2",
"device_map": "auto",
}

def get_image_from_url(image_url):
img_byte_stream = BytesIO(urlopen(image_url).read())
image = PILImage.open(img_byte_stream).convert("RGB")
image.format = "PNG"
return image

# Create the model
model = outlines.from_transformers(
AutoModelForImageTextToText.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", **model_kwargs),
AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", **model_kwargs)
)

IMAGE_URL = "https://upload.wikimedia.org/wikipedia/commons/2/25/Siam_lilacpoint.jpg"

# Create the chat mutimodal input
prompt = Chat([
{
"role": "user",
"content": [
{"type": "image", "image": Image(get_image_from_url(IMAGE_URL))},
{"type": "text", "text": "Describe the image in few words."}
],
}
])

# Call the model to generate a response
response = model(prompt, max_new_tokens=50)
print(response) # 'A Siamese cat with blue eyes is sitting on a cat tree, looking alert and curious.'
```

Or using a list containing text and assets:

```python
import outlines
from outlines.inputs import Chat, Image
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image as PILImage
from io import BytesIO
import requests
import torch


TEST_MODEL = "Qwen/Qwen2.5-VL-7B-Instruct"

# Function to get an image
def get_image(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
r = requests.get(url, headers=headers)
image = PILImage.open(BytesIO(r.content)).convert("RGB")
image.format = "PNG"
return image

model_kwargs = {
"torch_dtype": torch.bfloat16,
# "attn_implementation": "flash_attention_2",
"device_map": "auto",
}

# Create a model
model = outlines.from_transformers(
AutoModelForImageTextToText.from_pretrained(TEST_MODEL, **model_kwargs),
AutoProcessor.from_pretrained(TEST_MODEL, **model_kwargs),
)

# Create the chat input
prompt = Chat([
{"role": "user", "content": "You are a helpful assistant that helps me described pictures."},
{"role": "assistant", "content": "I'd be happy to help you describe pictures! Please go ahead and share an image"},
{
"role": "user",
"content": ["Describe briefly the image", Image(get_image("https://upload.wikimedia.org/wikipedia/commons/2/25/Siam_lilacpoint.jpg"))]
},
])

# Call the model to generate a response
response = model(prompt, max_new_tokens=50)
print(response) # 'The image shows a light-colored cat with a white chest...'
```


### Batching
The `TransformersMultiModal` model supports batching through the `batch` method. To use it, provide a list of prompts (using the formats described above) to the `batch` method. You will receive as a result a list of completions.

An example using the Chat format:

```python
import outlines
from outlines.inputs import Chat, Image
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image as PILImage
from io import BytesIO
from urllib.request import urlopen
import torch
from pydantic import BaseModel

model_kwargs = {
"torch_dtype": torch.bfloat16,
"attn_implementation": "flash_attention_2",
"device_map": "auto",
}

class Animal(BaseModel):
animal: str
color: str

def get_image_from_url(image_url):
img_byte_stream = BytesIO(urlopen(image_url).read())
image = PILImage.open(img_byte_stream).convert("RGB")
image.format = "PNG"
return image

# Create the model
model = outlines.from_transformers(
AutoModelForImageTextToText.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", **model_kwargs),
AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", **model_kwargs)
)

IMAGE_URL_1 = "https://upload.wikimedia.org/wikipedia/commons/2/25/Siam_lilacpoint.jpg"
IMAGE_URL_2 = "https://upload.wikimedia.org/wikipedia/commons/a/af/Golden_retriever_eating_pigs_foot.jpg"

# Create the chat mutimodal messages
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image in few words."},
{"type": "image", "image": Image(get_image_from_url(IMAGE_URL_1))},
],
},
]

messages_2 = [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image in few words."},
{"type": "image", "image": Image(get_image_from_url(IMAGE_URL_2))},
],
},
]

prompts = [Chat(messages), Chat(messages_2)]

# Call the model to generate a response
responses = model.batch(prompts, output_type=Animal, max_new_tokens=100)
print(responses) # ['{ "animal": "cat", "color": "white and gray" }', '{ "animal": "dog", "color": "white" }']
print([Animal.model_validate_json(i) for i in responses]) # [Animal(animal='cat', color='white and gray'), Animal(animal='dog', color='white')]
```


An example using a list of lists with tag assets:

```python
from io import BytesIO
from urllib.request import urlopen
Expand Down Expand Up @@ -119,7 +294,3 @@ result = model.batch(
)
print(result) # ['The image shows a cat', 'The image shows an astronaut']
```

!!! Warning

Make sure your prompt contains the tags expected by your processor to correctly inject the assets in the prompt. For some vision multimodal models for instance, you need to add as many `<image>` tags in your prompt as there are image assets included in your model input.
9 changes: 6 additions & 3 deletions outlines/inputs.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,8 +80,11 @@ class Chat:

Each message contained in the messages list must be a dict with 'role' and
'content' keys. The role can be 'user', 'assistant', or 'system'. The content
can be a string or a list containing a str and assets (images, videos,
audios, etc.) in the case of multimodal models.
supports either:
- a text string,
- a list containing text and assets (e.g., ["Describe...", Image(...)]),
- only for HuggingFace transformers models, a list of dict items with explicit types (e.g.,
[{"type": "text", "text": "Describe..."}, {"type": "image", "image": Image(...)}])

Examples
--------
Expand All @@ -95,7 +98,7 @@ class Chat:
chat_prompt.add_user_message(["Describe the image below", Image(image)])

# Add as an assistant message the response from the model.
chat_prompt.add_assistant_message("The is a black cat sitting on a couch.")
chat_prompt.add_assistant_message("There is a black cat sitting on a couch.")
```

Parameters
Expand Down
Loading
Loading