Skip to content

Conversation

wuxun-zhang
Copy link
Contributor

@wuxun-zhang wuxun-zhang commented Aug 14, 2025

This is to add data parallel support for V1 gaudi plugin.

  • add dp aware padding
  • use all_gather and reduce_scatter
  • add data parallel example

@wuxun-zhang wuxun-zhang force-pushed the wuxun/v1-dp-attention branch 2 times, most recently from d4a4c41 to 5ad7ff8 Compare August 20, 2025 09:00
@wuxun-zhang wuxun-zhang marked this pull request as ready for review August 20, 2025 09:06
@wuxun-zhang wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from a355755 to e9bc231 Compare August 20, 2025 15:11
@adobrzyn
Copy link
Collaborator

Please resolve conflicts

@adobrzyn
Copy link
Collaborator

/run-gaudi-tests

@wuxun-zhang wuxun-zhang force-pushed the wuxun/v1-dp-attention branch 2 times, most recently from 7fdd7dd to c056e11 Compare August 24, 2025 15:12
@wuxun-zhang
Copy link
Contributor Author

@adobrzyn Removed unused codes and rebased. Please review again.

@wuxun-zhang wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from c056e11 to 86c8e41 Compare August 25, 2025 07:37
@wuxun-zhang
Copy link
Contributor Author

/run-gaudi-tests

@sys-hab-pt-service
Copy link
Collaborator

Only codeowners can request to run Gaudi tests. Contact list: kzawora-intel, xuechendi, mswiniarsk, adobrzyn

@wuxun-zhang wuxun-zhang force-pushed the wuxun/v1-dp-attention branch 2 times, most recently from eae75cd to 1142665 Compare August 27, 2025 02:06
@wuxun-zhang
Copy link
Contributor Author

@adobrzyn @xuechendi @mswiniarsk @kzawora-intel Please help review this. Thanks.

@adobrzyn
Copy link
Collaborator

/run-gaudi-tests

@wuxun-zhang
Copy link
Contributor Author

Seems upstream vllm changes break gaudi plugin.

FAILED vllm-gaudi/tests/unit_tests/worker/test_hpu_model_runner.py::test_init_kv_cache_without_kv_sharing - AttributeError: 'ModelConfig' object has no attribute 'is_multimodal_raw_input_supported'. Did you mean: 'is_multimodal_raw_input_only_model'?

@wuxun-zhang wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from 7ae55fd to afe1a72 Compare September 3, 2025 08:26
@adobrzyn
Copy link
Collaborator

adobrzyn commented Sep 3, 2025

/run-gaudi-tests

@adobrzyn adobrzyn requested a review from Copilot September 3, 2025 09:02
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds data parallel support for the V1 Gaudi plugin by implementing DP-aware padding mechanisms and collective operations.

  • Implements DP-aware padding for prefill and decode batches to ensure consistent tensor shapes across data parallel ranks
  • Adds collective communication operations (all_gather, reduce_scatter) for expert parallelism support
  • Includes a comprehensive data parallel example script with multi-node support

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
vllm_gaudi/v1/worker/hpu_worker.py Updates distributed initialization to handle data parallel configuration and adds dummy batch execution
vllm_gaudi/v1/worker/hpu_model_runner.py Implements DP-aware padding logic and dummy batch creation for consistent execution across ranks
vllm_gaudi/platform.py Adds simple compile backend configuration
vllm_gaudi/distributed/device_communicators/hpu_communicator.py Implements dispatch/combine methods for expert parallelism with collective operations
tests/full_tests/ci_tests.sh Adds CI test for data parallel functionality
examples/data_parallel.py Provides complete example demonstrating data parallel usage

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

wuxun-zhang and others added 2 commits September 3, 2025 16:48
- enable profile run

Signed-off-by: Wuxun Zhang <[email protected]>
Signed-off-by: Wuxun Zhang <[email protected]>
@adobrzyn
Copy link
Collaborator

adobrzyn commented Sep 5, 2025

/run-gaudi-tests

@adobrzyn
Copy link
Collaborator

adobrzyn commented Sep 8, 2025

/run-gaudi-tests

Copy link
Collaborator

@kzawora-intel kzawora-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good overall, couple of nitpicks here and there

Copy link
Collaborator

@kzawora-intel kzawora-intel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thanks!

@kzawora-intel
Copy link
Collaborator

/run-gaudi-tests

@wuxun-zhang wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from 9508117 to fa5da77 Compare September 9, 2025 14:26
@kzawora-intel
Copy link
Collaborator

/run-gaudi-tests

@kzawora-intel kzawora-intel enabled auto-merge (squash) September 9, 2025 14:49
auto-merge was automatically disabled September 9, 2025 15:10

Head branch was pushed to by a user without write access

@wuxun-zhang wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from 3fd5295 to a7ca264 Compare September 9, 2025 15:11
@kzawora-intel
Copy link
Collaborator

/run-gaudi-tests

@kzawora-intel kzawora-intel enabled auto-merge (squash) September 9, 2025 15:12
@kzawora-intel
Copy link
Collaborator

/run-gaudi-tests

4 similar comments
@kzawora-intel
Copy link
Collaborator

/run-gaudi-tests

@kzawora-intel
Copy link
Collaborator

/run-gaudi-tests

@kzawora-intel
Copy link
Collaborator

/run-gaudi-tests

@kzawora-intel
Copy link
Collaborator

/run-gaudi-tests

@kzawora-intel
Copy link
Collaborator

the CI is unstable and 8 reruns were unsuccessful, all due to infra issues, i'm merging this as is - CI passed multiple times previously

@kzawora-intel kzawora-intel merged commit a2bcfca into vllm-project:main Sep 9, 2025
9 of 11 checks passed
kfojcik-intel pushed a commit to kfojcik-intel/vllm-gaudi that referenced this pull request Sep 12, 2025
This is to add data parallel support for V1 gaudi plugin.

- [x] add dp aware padding
- [x] use all_gather and reduce_scatter
- [x] add data parallel example

---------

Signed-off-by: Wuxun Zhang <[email protected]>
Co-authored-by: Konrad Zawora <[email protected]>
Co-authored-by: Agata Dobrzyniewicz <[email protected]>
Signed-off-by: Katarzyna Fojcik <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants