VLM prototyping #109

lucaslie · 2025-07-24T02:23:14Z

VLM Prototyping

A prototype for multi-modal models (starting with image+text to text) for AutoDeploy.

Setup + Testing

llama4.yaml:

args:
  model: meta-llama/Llama-4-Scout-17B-16E-Instruct
  world_size: 4
  runtime: demollm # or: trtllm
  compile_backend: torch-simple # not tested: torch-compile, torch-opt
  attn_page_size: 64
  max_input_len: 4096
  max_seq_len: 8192
  attn_backend: flashinfer
  model_factory: AutoModelForImageTextToText
  # uncomment below to quickly initialize/load a smaller, random weight model
  # skip_loading_weights: false
  # model_kwargs:
  #   text_config:
  #     num_hidden_layers: 3
  #   vision_config:
  #     num_hidden_layers: 3
prompt:
  batch_size: 4
  queries:
    - "How big is the universe? "
    - {"prompt": "In simple words and a single sentence, explain the concept of gravity: "}
    # see for chat template format: https://huggingface.co/docs/transformers/en/chat_templating_multimodal
    - - role: user
        content:
          - type: text
            text: How to fix slicing in golf?
    - - role: user
        content:
          - type: text
            text: Please describe the natural scenery you see in the following images
          - type: image
            url: https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png
          - type: image
            url: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png

Command to run

python examples/auto_deploy/build_and_run_ad.py --yaml-configs llama4.yaml

Features

Generalization SequenceInfo to handle extra arguments like images (represented as pixel_values). SequenceInfo is the general interface to handle sequence input/output for the graph model and interfaces with the graph model.
Updated the model factory interface to provide example input and define extra argument reflecting that we now model class dependent input arguments. The interface is used to correctly inform the SequenceInfo class about the relevant arguments to the model and how to configure them
Llama4 patch to handle conditional input as conditional graph execution inside a single graph
Defined an generic ADInputProcessor that wraps HF's chat template to provide a singular input processing utility for any multi-modal model based on the HF's chat template convention.
Correctly passing through extra arguments and multi-modal arguments from LLM api (and build_and_run_ad.py) to AutoDeploy backend
Handling mixed batch of text-only and text+image inputs
Runtime support for trtllm and demollm
E2e example with Llama4
Some more notes + discussions

Signed-off-by: Lucas Liebenwein <[email protected]>

lucaslie · 2025-09-05T18:30:01Z

see NVIDIA#7221

lucaslie self-assigned this Jul 24, 2025

lucaslie requested a review from suyoggupta July 24, 2025 02:23

lucaslie force-pushed the ll/vlm_kickoff branch 6 times, most recently from 05d205d to 92515a7 Compare July 30, 2025 16:31

lucaslie added 2 commits August 21, 2025 07:46

sequence interface revisited

fb3a82b

Signed-off-by: Lucas Liebenwein <[email protected]>

llama4 vlm

7bc2a40

Signed-off-by: Lucas Liebenwein <[email protected]>

lucaslie force-pushed the ll/vlm_kickoff branch from e5cbfcc to 7bc2a40 Compare August 21, 2025 14:47

lucaslie closed this Sep 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

VLM prototyping #109

VLM prototyping #109

Uh oh!

lucaslie commented Jul 24, 2025 •

edited

Loading

Uh oh!

lucaslie commented Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

VLM prototyping #109

VLM prototyping #109

Uh oh!

Conversation

lucaslie commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VLM Prototyping

Setup + Testing

Features

Uh oh!

lucaslie commented Sep 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lucaslie commented Jul 24, 2025 •

edited

Loading