Skip to content

[RFC]: Add Multimodal Model Recipes (Qwen2.5-VL, Qwen2.5-Omni, InternVL, etc) #10

@Gaohan123

Description

@Gaohan123

Motivation
This RFC proposes to add recipes for multimodal models already supported, such as Qwen2.5-VL, InternVL3, etc, and models possibly planned to support in the future, like qwen2.5-omni-talker #16347, VILA #11887.

Compared with pure LLMs, different multimodal models have various processing pipelines for multimodal inputs such as images, videos and audios. Therefore, it is an urgent need to clarify input format, usage and corresponding performance for each model in distinct tasks.

Besides, as the RFC #4194 outlines roadmap of supporting multi-modality along with V1 refactor, a large number of features have been finished while rest are ongoing. It is also very significant to provide latest evaluation and performance in general benchmarks with updated architecture for each multimodal model.

Proposed points for recipe

  1. Hyperparameters for different tasks
  2. Input/output processing methods and examples
  3. Evaluation results on typical benchmarks
  4. Performance data on certain Hardware architectures

Proposed models to add recipes (Included but not limited to):

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions