Skip to content

Conversation

@lucaslie
Copy link
Collaborator

@lucaslie lucaslie commented Jul 24, 2025

VLM Prototyping

A prototype for multi-modal models (starting with image+text to text) for AutoDeploy.

Setup + Testing

llama4.yaml:

args:
  model: meta-llama/Llama-4-Scout-17B-16E-Instruct
  world_size: 4
  runtime: demollm # or: trtllm
  compile_backend: torch-simple # not tested: torch-compile, torch-opt
  attn_page_size: 64
  max_input_len: 4096
  max_seq_len: 8192
  attn_backend: flashinfer
  model_factory: AutoModelForImageTextToText
  # uncomment below to quickly initialize/load a smaller, random weight model
  # skip_loading_weights: false
  # model_kwargs:
  #   text_config:
  #     num_hidden_layers: 3
  #   vision_config:
  #     num_hidden_layers: 3
prompt:
  batch_size: 4
  queries:
    - "How big is the universe? "
    - {"prompt": "In simple words and a single sentence, explain the concept of gravity: "}
    # see for chat template format: https://huggingface.co/docs/transformers/en/chat_templating_multimodal
    - - role: user
        content:
          - type: text
            text: How to fix slicing in golf?
    - - role: user
        content:
          - type: text
            text: Please describe the natural scenery you see in the following images
          - type: image
            url: https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/seashore.png
          - type: image
            url: https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png

Command to run

python examples/auto_deploy/build_and_run_ad.py --yaml-configs llama4.yaml

Features

  • Generalization SequenceInfo to handle extra arguments like images (represented as pixel_values). SequenceInfo is the general interface to handle sequence input/output for the graph model and interfaces with the graph model.
  • Updated the model factory interface to provide example input and define extra argument reflecting that we now model class dependent input arguments. The interface is used to correctly inform the SequenceInfo class about the relevant arguments to the model and how to configure them
  • Llama4 patch to handle conditional input as conditional graph execution inside a single graph
  • Defined an generic ADInputProcessor that wraps HF's chat template to provide a singular input processing utility for any multi-modal model based on the HF's chat template convention.
  • Correctly passing through extra arguments and multi-modal arguments from LLM api (and build_and_run_ad.py) to AutoDeploy backend
  • Handling mixed batch of text-only and text+image inputs
  • Runtime support for trtllm and demollm
  • E2e example with Llama4
  • Some more notes + discussions

@lucaslie lucaslie self-assigned this Jul 24, 2025
@lucaslie lucaslie requested a review from suyoggupta July 24, 2025 02:23
@lucaslie lucaslie force-pushed the ll/vlm_kickoff branch 6 times, most recently from 05d205d to 92515a7 Compare July 30, 2025 16:31
Signed-off-by: Lucas Liebenwein <[email protected]>
Signed-off-by: Lucas Liebenwein <[email protected]>
@lucaslie
Copy link
Collaborator Author

lucaslie commented Sep 5, 2025

see NVIDIA#7221

@lucaslie lucaslie closed this Sep 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant