Skip to content

Redirect max-serve-multimodal-structured-output recipe to docs site #74

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
191 changes: 2 additions & 189 deletions max-serve-multimodal-structured-output/README.md
Original file line number Diff line number Diff line change
@@ -1,190 +1,3 @@
# Image to JSON: Multimodal Structured Output with Llama 3.2 Vision, MAX and Pydantic
# Image to JSON: Multimodal Structured Output with Llama 3.2 Vision, MAX and Pydantic (MOVED)

In this recipe, we will cover:

* How to run a multimodal vision model using Llama 3.2 Vision and MAX
* Implementing structured output parsing with Pydantic models
* Converting image analysis into strongly-typed JSON

We'll walk through building a solution that showcases:

* MAX capabilities with multimodal models
* Structured output parsing with type safety
* Simple deployment using pixi CLI

Let's get started.

## Requirements

Please make sure your system meets our [system requirements](https://docs.modular.com/max/faq/#system-requirements).

To proceed, ensure you have the `pixi` CLI installed:

```bash
curl -fsSL https://pixi.sh/install.sh | sh
```

...and updated to the latest version:

```bash
pixi self-update
```

For this recipe, you will need:

* Access to [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) model
* A valid [Hugging Face token](https://huggingface.co/settings/tokens) for accessing Llama 3.2 Vision

Set up your environment variables:

```bash
export HUGGING_FACE_HUB_TOKEN=<your-token-yere>
```

### GPU requirements

**Structured output with MAX requires GPU access**. For running the app on GPU, ensure your system meets these GPU requirements:

* Supported GPUs: NVIDIA H100 / H200, A100, A10G, L4, or L40
* NVIDIA Drivers: [Installation guide here](https://www.nvidia.com/download/index.aspx)
* NVIDIA Container Toolkit: [Installation guide here](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)

## Quick start

1. Download the code for this recipe:

```bash
git clone https://github.com/modularml/max-recipes.git
cd max-recipes/max-serve-multimodal-structured-output
```

2. Run the server with vision model:

**Make sure the port `8010` is available. You can adjust the port settings in [pyproject.toml](./pyproject.toml).**

```bash
pixi run server
```

This will start MAX with Llama 3.2 Vision and

3. Run the example code that extracts player information from a basketball image via:

```bash
pixi run main
```

<img src="image.jpg" alt="Players" width="100%" style="max-width: 1024px;">

The output will look like:

```json
{
"players": [
{
"name": "Klay Thompson",
"number": 11
},
{
"name": "Stephen Curry",
"number": 30
},
{
"name": "Kevin Durant",
"number": 35
}
]
}
```

## Features of Llama 3.2 Vision structured output

* **Multimodal processing**: Handles both image and text inputs seamlessly
* **OpenAI-compatible API**: Works with standard OpenAI client libraries
* **Type-safe output**: Uses Pydantic models to ensure structured and validated JSON output

## Structured output with Pydantic

The core of our structured output system uses Pydantic models to define the expected data structure:

```python
from pydantic import BaseModel

class Player(BaseModel):
name: str = Field(description="Player name on jersey")
number: int = Field(description="Player number on jersey")

class Players(BaseModel):
players: List[Player] = Field(description="List of players visible in the image")
```

How it works:

* **Type validation**: Pydantic ensures all fields match their expected types
* **Field descriptions**: Help guide the model in extracting the right information
* **Nested structures**: Support complex data hierarchies for detailed information extraction

### Vision model integration

The application uses the OpenAI client to communicate with MAX:

```python
from openai import OpenAI

client = OpenAI(api_key="local", base_url="http://localhost:8000/v1")

completion = client.beta.chat.completions.parse(
model="meta-llama/Llama-3.2-11B-Vision-Instruct",
messages=[
{"role": "system", "content": "Extract player information from the image."},
{"role": "user", "content": [
{
"type": "text",
"text": "Please provide a list of players visible in this photo with their jersey numbers."
},
{
"type": "image_url",
"image_url": {
"url": "https://ei.marketwatch.com/Multimedia/2019/02/15/Photos/ZH/MW-HE047_nbajer_20190215102153_ZH.jpg"
}
}
]},
],
response_format=Players,
)
```

Key components:

* **Multimodal input**: Combines text instructions with image data
* **Structured parsing**: Uses the Players model to format the response
* **Inference**: Runs on MAX with OpenAI-compatible API

### MAX Options

To enable structured output in MAX, simply include `--enable-structured-output`.

To see other options, make sure to check out the help:

```bash
max serve --help
```

## Conclusion

In this recipe, we've built a system for extracting structured data from images using Llama 3.2 Vision and MAX. We've explored:

* **Basic setup**: Using MAX with GPU support
* **Structured output**: Implementing type-safe data extraction with Pydantic
* **Multimodal processing**: Handling both image and text inputs
* **Local deployment**: Running the model with MAX

This implementation provides a foundation for building more complex vision-based applications with structured output.

## Next steps

* Deploy Llama 3.2 Vision on GPU with MAX to [AWS, GCP or Azure](https://docs.modular.com/max/tutorials/max-serve-local-to-cloud/) or on [Kubernetes](https://docs.modular.com/max/tutorials/deploy-max-serve-on-kubernetes/)
* Explore MAX's [documentation](https://docs.modular.com/max/) for additional features
* Join our [Modular Forum](https://forum.modular.com/) and [Discord community](https://discord.gg/modular) to share your experiences and get support

We're excited to see what you'll build with Llama 3.2 Vision and MAX! Share your projects and experiences with us using `#ModularAI` on social media.
This recipe has moved to the Modular docs for [Deploy Llama Vision](https://docs.modular.com/max/tutorials/deploy-llama-vision/).