-
Notifications
You must be signed in to change notification settings - Fork 56
Description
Motivation
This RFC proposes to add recipes for multimodal models already supported, such as Qwen2.5-VL, InternVL3, etc, and models possibly planned to support in the future, like qwen2.5-omni-talker #16347, VILA #11887.
Compared with pure LLMs, different multimodal models have various processing pipelines for multimodal inputs such as images, videos and audios. Therefore, it is an urgent need to clarify input format, usage and corresponding performance for each model in distinct tasks.
Besides, as the RFC #4194 outlines roadmap of supporting multi-modality along with V1 refactor, a large number of features have been finished while rest are ongoing. It is also very significant to provide latest evaluation and performance in general benchmarks with updated architecture for each multimodal model.
Proposed points for recipe
- Hyperparameters for different tasks
- Input/output processing methods and examples
- Evaluation results on typical benchmarks
- Performance data on certain Hardware architectures
Proposed models to add recipes (Included but not limited to):
- Qwen2.5-VL Add Qwen2.5VL Guide #30
- InternVL3 Add InternVL3 Guide #35
- Skywork R1V
- Granite-speech-3.3-8b
- Llama 4 Add recipes for Llama3.3 70B and Llama4 Scout #13
- Gemma3
- GLM-4.5 GLM-4.5 and GLM-4.5V #23
- Qwen2.5-Omni (Talker not supported in vLLM yet)
- VILA (Not supported in vLLM yet)
- BAGEL (Not supported in vLLM yet)