-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
[Model] Add LoRA support for Whisper models #29856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Documentation preview: https://vllm--29856.org.readthedocs.build/en/29856/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces multi-LoRA support for Whisper models, which is a valuable addition. The implementation is robust and well-engineered. I appreciate that instead of a model-specific hack, the changes generalize the existing LoRA infrastructure to support Whisper's architecture, particularly the KV-only packed layers in cross-attention. The inclusion of comprehensive unit tests and a clear example script significantly enhances the quality and usability of this contribution. The code is clean, the logic is sound, and the changes are well-documented. Overall, this is an excellent pull request.
8897a78 to
93182eb
Compare
This PR enables Multi-LoRA support for Whisper speech-to-text models, allowing users to serve multiple fine-tuned Whisper adapters from a single base model. Changes: - Add SupportsLoRA interface to WhisperForConditionalGeneration - Add embedding_modules and embedding_padding_modules attributes - Update packed_modules_mapping for LoRA compatibility - Extend MergedQKVParallelLinearWithLoRA to support KV-only (2-slice) configurations used in Whisper's cross-attention layers - Add fallback to max_target_positions in WorkerLoRAManager for Whisper compatibility - Add example script for Whisper Multi-LoRA inference - Add unit tests for Whisper LoRA support Signed-off-by: daje0601 <[email protected]>
93182eb to
ba3826b
Compare
|
Will look at this PR ASAP, also cc @NickLucche |
jeejeelee
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your contribution. The main concern is that maybe we should use MergedColumnParallelLinear rather than QKVLinear in the base model
| # LoRA-specific attributes | ||
| embedding_modules = {} | ||
| embedding_padding_modules: list[str] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the model inherits from SupportsLoRA, these two attributes are empty by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I'll remove these redundant.
| @@ -0,0 +1,136 @@ | |||
| # SPDX-License-Identifier: Apache-2.0 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this example is similar to multilora_inference.py, so do we need to add this example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right - it's similar to the existing multilora_inference.py.
I'll remove whisper_multilora_inference.py from this PR.
| packed_modules_list: list, | ||
| model_config: PretrainedConfig | None = None, | ||
| ) -> bool: | ||
| return type(source_layer) is QKVParallelLinear and len(packed_modules_list) == 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use MergedColumnParallelLinear rather than QKVParallelLinear in base model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will:
- Revert my changes to MergedQKVParallelLinearWithLoRA in column_parallel_linear.py
- Update whisper.py to use MergedColumnParallelLinear for the cross-attention's kv_proj layer
I'll update the PR with these changes shortly. Thanks again for the review!
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
This PR enables Multi-LoRA support for Whisper speech-to-text models, allowing users to serve multiple fine-tuned Whisper adapters from a single base model.
Background
Currently, vLLM's
WhisperForConditionalGenerationdoes not implement theSupportsLoRAinterface, preventing users from using LoRA adapters with Whisper models. This limitation requires users to deployseparate model instances for each fine-tuned variant, which is inefficient in terms of GPU memory usage.
Changes
1.
vllm/model_executor/models/whisper.pySupportsLoRAinterface toWhisperForConditionalGenerationembedding_modulesandembedding_padding_modulesattributes required by LoRApacked_modules_mappingto use simplified keys (qkv_proj,kv_proj) for LoRA compatibility2.
vllm/lora/layers/column_parallel_linear.pyMergedQKVParallelLinearWithLoRAto support KV-only (2-slice) configurationsencoder_attn.kv_proj) only have K and V projections, not Qcan_replace_layer()to accept both 2-module and 3-module configurationsslice_lora_a()to dynamically handle variable number of slices3.
vllm/lora/worker_manager.pymax_target_positionswhenmax_position_embeddingsis not availablemax_target_positionsinstead ofmax_position_embeddings4.
examples/offline_inference/whisper_multilora_inference.py5.
tests/lora/test_whisper_lora.pyTest Plan
Test Result(Unit Tests)
Manual Testing
Tested with openai/whisper-large-v3-turbo base model and custom LoRA adapters:
Example Usage
or