[Model] Add LoRA support for Whisper models #29856

daje0601 · 2025-12-02T08:56:32Z

Purpose

This PR enables Multi-LoRA support for Whisper speech-to-text models, allowing users to serve multiple fine-tuned Whisper adapters from a single base model.

Background

Currently, vLLM's WhisperForConditionalGeneration does not implement the SupportsLoRA interface, preventing users from using LoRA adapters with Whisper models. This limitation requires users to deploy
separate model instances for each fine-tuned variant, which is inefficient in terms of GPU memory usage.

Changes

1. vllm/model_executor/models/whisper.py

Add SupportsLoRA interface to WhisperForConditionalGeneration
Add embedding_modules and embedding_padding_modules attributes required by LoRA
Update packed_modules_mapping to use simplified keys (qkv_proj, kv_proj) for LoRA compatibility

2. vllm/lora/layers/column_parallel_linear.py

Extend MergedQKVParallelLinearWithLoRA to support KV-only (2-slice) configurations
This is necessary because Whisper's cross-attention layers (encoder_attn.kv_proj) only have K and V projections, not Q
Update can_replace_layer() to accept both 2-module and 3-module configurations
Refactor slice_lora_a() to dynamically handle variable number of slices

3. vllm/lora/worker_manager.py

Add fallback to max_target_positions when max_position_embeddings is not available
Whisper config uses max_target_positions instead of max_position_embeddings

4. examples/offline_inference/whisper_multilora_inference.py

Add example script demonstrating Whisper Multi-LoRA inference

5. tests/lora/test_whisper_lora.py

Add unit tests for Whisper LoRA interface compliance
Add tests for KV-only configuration support
Add tests for WorkerLoRAManager Whisper compatibility

Test Plan

# Run unit tests
pytest tests/lora/test_whisper_lora.py -v

# Run example (requires LoRA adapter)
python examples/offline_inference/whisper_multilora_inference.py

Test Result(Unit Tests)

tests/lora/test_whisper_lora.py::TestWhisperLoRAInterface::test_supports_lora_attribute PASSED
tests/lora/test_whisper_lora.py::TestWhisperLoRAInterface::test_embedding_modules_defined PASSED
tests/lora/test_whisper_lora.py::TestWhisperLoRAInterface::test_embedding_padding_modules_defined PASSED
tests/lora/test_whisper_lora.py::TestWhisperLoRAInterface::test_packed_modules_mapping_format PASSED
tests/lora/test_whisper_lora.py::TestMergedQKVParallelLinearWithLoRAKVOnly::test_can_replace_layer_accepts_2_modules PASSED
tests/lora/test_whisper_lora.py::TestWorkerLoRAManagerWhisperCompat::test_max_position_embeddings_fallback PASSED
tests/lora/test_whisper_lora.py::TestWorkerLoRAManagerWhisperCompat::test_max_position_embeddings_priority PASSED

Manual Testing
Tested with openai/whisper-large-v3-turbo base model and custom LoRA adapters:

Server startup with --enable-lora flag: ✅
Single LoRA inference: ✅
Multi-LoRA switching between requests: ✅
Concurrent requests with different LoRAs: ✅

Example Usage

from vllm import LLM
from vllm.lora.request import LoRARequest

# Initialize with LoRA support
llm = LLM(
    model="openai/whisper-large-v3-turbo",
    enable_lora=True,
    max_loras=4,
    max_lora_rank=64,
)

# Use different LoRA adapters per request
outputs = llm.generate(
    inputs,
    lora_request=LoRARequest("my_whisper_lora", 1, "/path/to/lora")
)

or

vllm serve yourname/yourmodel \
--host 0.0.0.0 \
--port 8181 \
--dtype bfloat16 \
--trust-remote-code \
--enable-lora \
--lora-modules  \
lora1=lora_module_path \
lora2=lora_module_path \
--max-lora-rank 32 \
--gpu-memory-utilization 0.7 \
--tensor-parallel-size 1

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0.

mergify · 2025-12-02T08:57:17Z

Documentation preview: https://vllm--29856.org.readthedocs.build/en/29856/

gemini-code-assist

Code Review

This pull request introduces multi-LoRA support for Whisper models, which is a valuable addition. The implementation is robust and well-engineered. I appreciate that instead of a model-specific hack, the changes generalize the existing LoRA infrastructure to support Whisper's architecture, particularly the KV-only packed layers in cross-attention. The inclusion of comprehensive unit tests and a clear example script significantly enhances the quality and usability of this contribution. The code is clean, the logic is sound, and the changes are well-documented. Overall, this is an excellent pull request.

This PR enables Multi-LoRA support for Whisper speech-to-text models, allowing users to serve multiple fine-tuned Whisper adapters from a single base model. Changes: - Add SupportsLoRA interface to WhisperForConditionalGeneration - Add embedding_modules and embedding_padding_modules attributes - Update packed_modules_mapping for LoRA compatibility - Extend MergedQKVParallelLinearWithLoRA to support KV-only (2-slice) configurations used in Whisper's cross-attention layers - Add fallback to max_target_positions in WorkerLoRAManager for Whisper compatibility - Add example script for Whisper Multi-LoRA inference - Add unit tests for Whisper LoRA support Signed-off-by: daje0601 <[email protected]>

jeejeelee · 2025-12-03T06:32:12Z

Will look at this PR ASAP, also cc @NickLucche

jeejeelee

Thank you for your contribution. The main concern is that maybe we should use MergedColumnParallelLinear rather than QKVLinear in the base model

jeejeelee · 2025-12-05T16:31:31Z

vllm/model_executor/models/whisper.py

+    # LoRA-specific attributes
+    embedding_modules = {}
+    embedding_padding_modules: list[str] = []


If the model inherits from SupportsLoRA, these two attributes are empty by default

Thank you, I'll remove these redundant.

jeejeelee · 2025-12-05T16:35:06Z

examples/offline_inference/whisper_multilora_inference.py

@@ -0,0 +1,136 @@
+# SPDX-License-Identifier: Apache-2.0


It looks like this example is similar to multilora_inference.py, so do we need to add this example?

You're right - it's similar to the existing multilora_inference.py.
I'll remove whisper_multilora_inference.py from this PR.

jeejeelee · 2025-12-05T16:38:34Z

vllm/lora/layers/column_parallel_linear.py

        packed_modules_list: list,
        model_config: PretrainedConfig | None = None,
    ) -> bool:
-        return type(source_layer) is QKVParallelLinear and len(packed_modules_list) == 3


Can we use MergedColumnParallelLinear rather than QKVParallelLinear in base model?

I will:

Revert my changes to MergedQKVParallelLinearWithLoRA in column_parallel_linear.py

Update whisper.py to use MergedColumnParallelLinear for the cross-attention's kv_proj layer

I'll update the PR with these changes shortly. Thanks again for the review!

mergify · 2025-12-06T16:11:32Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @daje0601.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

daje0601 requested a review from jeejeelee as a code owner December 2, 2025 08:56

mergify bot added the documentation Improvements or additions to documentation label Dec 2, 2025

gemini-code-assist bot reviewed Dec 2, 2025

View reviewed changes

daje0601 closed this Dec 2, 2025

daje0601 reopened this Dec 2, 2025

daje0601 force-pushed the whisper-multi-lora-support branch from 8897a78 to 93182eb Compare December 2, 2025 11:25

daje0601 force-pushed the whisper-multi-lora-support branch from 93182eb to ba3826b Compare December 2, 2025 11:36

jeejeelee reviewed Dec 5, 2025

View reviewed changes

mergify bot added the needs-rebase label Dec 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Model] Add LoRA support for Whisper models #29856

[Model] Add LoRA support for Whisper models #29856

daje0601 commented Dec 2, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Dec 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

jeejeelee commented Dec 3, 2025

Uh oh!

jeejeelee left a comment

Uh oh!

jeejeelee Dec 5, 2025

Uh oh!

daje0601 Dec 6, 2025

Uh oh!

jeejeelee Dec 5, 2025

Uh oh!

daje0601 Dec 6, 2025

Uh oh!

jeejeelee Dec 5, 2025

Uh oh!

daje0601 Dec 6, 2025

Uh oh!

mergify bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Model] Add LoRA support for Whisper models #29856

Are you sure you want to change the base?

[Model] Add LoRA support for Whisper models #29856

Conversation

daje0601 commented Dec 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Background

Changes

Test Plan

Uh oh!

mergify bot commented Dec 2, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

jeejeelee commented Dec 3, 2025

Uh oh!

jeejeelee left a comment

Choose a reason for hiding this comment

Uh oh!

jeejeelee Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

daje0601 Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

jeejeelee Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

daje0601 Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

jeejeelee Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

daje0601 Dec 6, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

daje0601 commented Dec 2, 2025 •

edited by github-actions bot

Loading