Add data parallel support #80

wuxun-zhang · 2025-08-14T02:00:10Z

This is to add data parallel support for V1 gaudi plugin.

add dp aware padding
use all_gather and reduce_scatter
add data parallel example

adobrzyn · 2025-08-22T11:57:51Z

Please resolve conflicts

adobrzyn · 2025-08-22T11:58:02Z

/run-gaudi-tests

vllm_gaudi/v1/worker/hpu_worker.py

wuxun-zhang · 2025-08-25T00:29:32Z

@adobrzyn Removed unused codes and rebased. Please review again.

wuxun-zhang · 2025-08-25T07:41:10Z

/run-gaudi-tests

sys-hab-pt-service · 2025-08-25T07:41:34Z

Only codeowners can request to run Gaudi tests. Contact list: kzawora-intel, xuechendi, mswiniarsk, adobrzyn

wuxun-zhang · 2025-08-27T02:07:40Z

@adobrzyn @xuechendi @mswiniarsk @kzawora-intel Please help review this. Thanks.

adobrzyn · 2025-08-29T06:40:46Z

/run-gaudi-tests

wuxun-zhang · 2025-08-29T07:38:56Z

Seems upstream vllm changes break gaudi plugin.

FAILED vllm-gaudi/tests/unit_tests/worker/test_hpu_model_runner.py::test_init_kv_cache_without_kv_sharing - AttributeError: 'ModelConfig' object has no attribute 'is_multimodal_raw_input_supported'. Did you mean: 'is_multimodal_raw_input_only_model'?

adobrzyn · 2025-09-03T08:50:30Z

/run-gaudi-tests

Copilot

Pull Request Overview

This PR adds data parallel support for the V1 Gaudi plugin by implementing DP-aware padding mechanisms and collective operations.

Implements DP-aware padding for prefill and decode batches to ensure consistent tensor shapes across data parallel ranks
Adds collective communication operations (all_gather, reduce_scatter) for expert parallelism support
Includes a comprehensive data parallel example script with multi-node support

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
vllm_gaudi/v1/worker/hpu_worker.py	Updates distributed initialization to handle data parallel configuration and adds dummy batch execution
vllm_gaudi/v1/worker/hpu_model_runner.py	Implements DP-aware padding logic and dummy batch creation for consistent execution across ranks
vllm_gaudi/platform.py	Adds simple compile backend configuration
vllm_gaudi/distributed/device_communicators/hpu_communicator.py	Implements dispatch/combine methods for expert parallelism with collective operations
tests/full_tests/ci_tests.sh	Adds CI test for data parallel functionality
examples/data_parallel.py	Provides complete example demonstrating data parallel usage

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

vllm_gaudi/v1/worker/hpu_model_runner.py

examples/data_parallel.py

- enable profile run Signed-off-by: Wuxun Zhang <[email protected]>

Signed-off-by: Wuxun Zhang <[email protected]>

adobrzyn · 2025-09-05T07:41:23Z

/run-gaudi-tests

Signed-off-by: Wuxun Zhang <[email protected]>

adobrzyn · 2025-09-08T06:40:54Z

/run-gaudi-tests

kzawora-intel

looks good overall, couple of nitpicks here and there

vllm_gaudi/v1/worker/hpu_worker.py

pyproject.toml

vllm_gaudi/v1/worker/hpu_model_runner.py

Signed-off-by: Wuxun Zhang <[email protected]>

kzawora-intel

looks good, thanks!

kzawora-intel · 2025-09-09T06:41:15Z

/run-gaudi-tests

Signed-off-by: Wuxun Zhang <[email protected]>

kzawora-intel · 2025-09-09T14:49:14Z

/run-gaudi-tests

Signed-off-by: Wuxun Zhang <[email protected]>

kzawora-intel · 2025-09-09T15:12:47Z

/run-gaudi-tests

kzawora-intel · 2025-09-09T15:41:20Z

/run-gaudi-tests

kzawora-intel · 2025-09-09T16:10:01Z

/run-gaudi-tests

kzawora-intel · 2025-09-09T16:32:05Z

/run-gaudi-tests

kzawora-intel · 2025-09-09T17:07:14Z

/run-gaudi-tests

kzawora-intel · 2025-09-09T17:17:26Z

/run-gaudi-tests

kzawora-intel · 2025-09-09T18:41:32Z

the CI is unstable and 8 reruns were unsuccessful, all due to infra issues, i'm merging this as is - CI passed multiple times previously

This is to add data parallel support for V1 gaudi plugin. - [x] add dp aware padding - [x] use all_gather and reduce_scatter - [x] add data parallel example --------- Signed-off-by: Wuxun Zhang <[email protected]> Co-authored-by: Konrad Zawora <[email protected]> Co-authored-by: Agata Dobrzyniewicz <[email protected]> Signed-off-by: Katarzyna Fojcik <[email protected]>

wuxun-zhang force-pushed the wuxun/v1-dp-attention branch 2 times, most recently from d4a4c41 to 5ad7ff8 Compare August 20, 2025 09:00

wuxun-zhang marked this pull request as ready for review August 20, 2025 09:06

wuxun-zhang requested review from kzawora-intel, xuechendi, mswiniarsk and adobrzyn as code owners August 20, 2025 09:06

wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from a355755 to e9bc231 Compare August 20, 2025 15:11

adobrzyn reviewed Aug 22, 2025

View reviewed changes

vllm_gaudi/v1/worker/hpu_worker.py Outdated Show resolved Hide resolved

adobrzyn reviewed Aug 22, 2025

View reviewed changes

vllm_gaudi/v1/worker/hpu_worker.py Outdated Show resolved Hide resolved

wuxun-zhang force-pushed the wuxun/v1-dp-attention branch 2 times, most recently from 7fdd7dd to c056e11 Compare August 24, 2025 15:12

wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from c056e11 to 86c8e41 Compare August 25, 2025 07:37

wuxun-zhang force-pushed the wuxun/v1-dp-attention branch 2 times, most recently from eae75cd to 1142665 Compare August 27, 2025 02:06

wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from 7ae55fd to afe1a72 Compare September 3, 2025 08:26

adobrzyn requested a review from Copilot September 3, 2025 09:02

Copilot AI reviewed Sep 3, 2025

View reviewed changes

vllm_gaudi/v1/worker/hpu_model_runner.py Show resolved Hide resolved

vllm_gaudi/v1/worker/hpu_model_runner.py Outdated Show resolved Hide resolved

examples/data_parallel.py Outdated Show resolved Hide resolved

adobrzyn reviewed Sep 3, 2025

View reviewed changes

examples/data_parallel.py Show resolved Hide resolved

wuxun-zhang and others added 2 commits September 3, 2025 16:48

Support Data Parallel

b4b908a

- enable profile run Signed-off-by: Wuxun Zhang <[email protected]>

fix

90532c8

Signed-off-by: Wuxun Zhang <[email protected]>

kzawora-intel and others added 2 commits September 4, 2025 17:19

Merge branch 'main' into wuxun/v1-dp-attention

2a85ac5

Merge branch 'main' into wuxun/v1-dp-attention

ce53143

wuxun-zhang added 2 commits September 8, 2025 08:31

fix precommit issue

3e15764

Signed-off-by: Wuxun Zhang <[email protected]>

Merge remote-tracking branch 'origin/main' into wuxun/v1-dp-attention

dcc2928

kzawora-intel requested changes Sep 8, 2025

View reviewed changes

vllm_gaudi/v1/worker/hpu_worker.py Outdated Show resolved Hide resolved

pyproject.toml Show resolved Hide resolved

vllm_gaudi/v1/worker/hpu_model_runner.py Outdated Show resolved Hide resolved

vllm_gaudi/v1/worker/hpu_model_runner.py Outdated Show resolved Hide resolved

wuxun-zhang added 2 commits September 8, 2025 15:24

Merge remote-tracking branch 'origin/main' into wuxun/v1-dp-attention

0ed0036

address comments

daf05f2

Signed-off-by: Wuxun Zhang <[email protected]>

kzawora-intel approved these changes Sep 8, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into wuxun/v1-dp-attention

fa5da77

Signed-off-by: Wuxun Zhang <[email protected]>

wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from 9508117 to fa5da77 Compare September 9, 2025 14:26

kzawora-intel enabled auto-merge (squash) September 9, 2025 14:49

wuxun-zhang added 2 commits September 9, 2025 17:53

Merge remote-tracking branch 'origin/main' into wuxun/v1-dp-attention

4f32ae1

fix missing args

a7ca264

Signed-off-by: Wuxun Zhang <[email protected]>

auto-merge was automatically disabled September 9, 2025 15:10
Head branch was pushed to by a user without write access

wuxun-zhang force-pushed the wuxun/v1-dp-attention branch from 3fd5295 to a7ca264 Compare September 9, 2025 15:11

kzawora-intel enabled auto-merge (squash) September 9, 2025 15:12

kzawora-intel disabled auto-merge September 9, 2025 18:41

kzawora-intel merged commit a2bcfca into vllm-project:main Sep 9, 2025
9 of 11 checks passed

Add data parallel support #80

Add data parallel support #80

Uh oh!

Conversation

wuxun-zhang commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adobrzyn commented Aug 22, 2025

Uh oh!

adobrzyn commented Aug 22, 2025

Uh oh!

Uh oh!

Uh oh!

wuxun-zhang commented Aug 25, 2025

Uh oh!

wuxun-zhang commented Aug 25, 2025

Uh oh!

sys-hab-pt-service commented Aug 25, 2025

Uh oh!

wuxun-zhang commented Aug 27, 2025

Uh oh!

adobrzyn commented Aug 29, 2025

Uh oh!

wuxun-zhang commented Aug 29, 2025

Uh oh!

adobrzyn commented Sep 3, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adobrzyn commented Sep 5, 2025

Uh oh!

adobrzyn commented Sep 8, 2025

Uh oh!

kzawora-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kzawora-intel left a comment

Choose a reason for hiding this comment

Uh oh!

kzawora-intel commented Sep 9, 2025

Uh oh!

kzawora-intel commented Sep 9, 2025

Uh oh!

kzawora-intel commented Sep 9, 2025

Uh oh!

kzawora-intel commented Sep 9, 2025

Uh oh!

kzawora-intel commented Sep 9, 2025

Uh oh!

kzawora-intel commented Sep 9, 2025

Uh oh!

kzawora-intel commented Sep 9, 2025

Uh oh!

kzawora-intel commented Sep 9, 2025

Uh oh!

kzawora-intel commented Sep 9, 2025

Uh oh!

Uh oh!

Uh oh!

wuxun-zhang commented Aug 14, 2025 •

edited

Loading