enable multi-node in external launcher mode #29833

xieyangxu · 2025-12-02T01:39:41Z

Multi-node (nnodes > 1) in external launcher mode is a valid use case.

gemini-code-assist

Code Review

This pull request enables multi-node support for the external_launcher distributed backend by allowing nnodes > 1. While the change in vllm/config/parallel.py is correct in principle, my review has identified a critical issue that this change exposes in the distributed environment initialization logic. Specifically, the init_distributed_environment function in vllm/distributed/parallel_state.py will incorrectly override the distributed_init_method for external_launcher in a multi-node setup, which will likely cause failures. This needs to be addressed for the feature to work as intended.

gemini-code-assist · 2025-12-02T01:40:58Z

vllm/config/parallel.py

+        if self.distributed_executor_backend not in ("mp", "uni", "external_launcher") and self.nnodes > 1:
            raise ValueError(
                "nnodes > 1 can only be set when distributed executor "
-                "backend is mp or uni."
+                "backend is mp, uni or external_launcher."
            )


This change correctly allows nnodes > 1 for the external_launcher backend. However, it exposes a critical issue in vllm/distributed/parallel_state.py.

The init_distributed_environment function (lines 1172-1178) unconditionally overrides distributed_init_method when nnodes > 1. This will break external_launcher, which relies on distributed_init_method="env://" for multi-node setup with tools like torchrun. The override logic uses master_addr and master_port, which is intended for the mp backend.

To fix this, the logic in init_distributed_environment should be made conditional, for example:

if (config is not None and config.parallel_config.nnodes > 1 and config.parallel_config.distributed_executor_backend == "mp"): # ... existing override logic

Without this correction in the related file, multi-node execution with external_launcher will likely fail.

Updated accordingly and refactored the if-elif branch in parallel_state.py for simplicity.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-12-02T01:42:38Z

vllm/config/parallel.py

+        if self.distributed_executor_backend not in ("mp", "uni", "external_launcher") and self.nnodes > 1:
            raise ValueError(
                "nnodes > 1 can only be set when distributed executor "
-                "backend is mp or uni."
+                "backend is mp, uni or external_launcher."


Avoid double-counting DP world size for external_launcher multi-node

Allowing nnodes > 1 with distributed_executor_backend="external_launcher" (596-599) now sends these runs down the multi-node init path, but in this mode world_size has already been multiplied by data_parallel_size in post_init (lines 510-512) and world_size_across_dp multiplies by data_parallel_size again (lines 323-326). With external launcher configs that set both data_parallel_size>1 and nnodes>1, init_distributed_environment will call init_process_group with a world size that is data_parallel_size times larger than the torchrun WORLD_SIZE, leading to initialization failure/hangs for those multi-node jobs. The DP factor should not be applied twice for external_launcher when nnodes>1.

Useful? React with 👍 / 👎.

hmellor

Was this an oversight or was it previously unsupported?

If it was previously unsupported, can you link to where support was added?

xieyangxu · 2025-12-02T23:16:40Z

Was this an oversight or was it previously unsupported?

If it was previously unsupported, can you link to where support was added?

#23691 added multi-node support originally for mp backend but it works for external launcher mode as well.

hmellor · 2025-12-02T23:50:44Z

Thanks for the extra context, would it be worth adding a test?

Also pre-commit is failing, try:

uv pip install pre-commit
pre-commit install
pre-commit run -a

zhuohan123

Please follow the codex comment to add more asserts and also consider adding some tests.

Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795

xieyangxu · 2025-12-03T00:36:26Z

/gemini review

xieyangxu · 2025-12-03T00:36:58Z

@codex review

gemini-code-assist

Code Review

This pull request enables multi-node support for the external_launcher mode. This is achieved by allowing nnodes > 1 with this backend and letting torch.distributed handle initialization from environment variables, which is the correct approach for external launchers. The related logic for other distributed backends has been refactored to accommodate this change. The changes look good, but I've identified a minor regression in the refactoring that affects logging, which could be confusing for debugging.

gemini-code-assist · 2025-12-03T00:37:57Z

vllm/distributed/parallel_state.py

+        # Use appropriate IP and port based on configuration
+        if parallel_config.nnodes > 1:
+            ip = parallel_config.master_addr
+            port = parallel_config.master_port
+        else:
+            ip = parallel_config.data_parallel_master_ip
+            port = parallel_config.get_next_dp_init_port()
+            logger.debug(
+                "Adjusting world_size=%d rank=%d distributed_init_method=%s for DP",
+                world_size,
+                rank,
+                distributed_init_method,
+            )
+
        distributed_init_method = get_distributed_init_method(ip, port)


This refactoring has introduced a regression. The logger.debug call on line 1194 now logs the value of distributed_init_method before it is updated with the new ip and port for the data parallelism case (the update happens on line 1201). In the original code, the log happened after the update. This can be misleading when debugging.

To fix this, the get_distributed_init_method call should be moved to before it is used in the log message.

Suggested change

# Use appropriate IP and port based on configuration

if parallel_config.nnodes > 1:

ip = parallel_config.master_addr

port = parallel_config.master_port

else:

ip = parallel_config.data_parallel_master_ip

port = parallel_config.get_next_dp_init_port()

logger.debug(

"Adjusting world_size=%d rank=%d distributed_init_method=%s for DP",

world_size,

rank,

distributed_init_method,

)

distributed_init_method = get_distributed_init_method(ip, port)

# Use appropriate IP and port based on configuration

if parallel_config.nnodes > 1:

ip = parallel_config.master_addr

port = parallel_config.master_port

distributed_init_method = get_distributed_init_method(ip, port)

else:

ip = parallel_config.data_parallel_master_ip

port = parallel_config.get_next_dp_init_port()

distributed_init_method = get_distributed_init_method(ip, port)

logger.debug(

"Adjusting world_size=%d rank=%d distributed_init_method=%s for DP",

world_size,

rank,

distributed_init_method,

)

chatgpt-codex-connector · 2025-12-03T00:41:56Z

Codex Review: Didn't find any major issues. Delightful!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795

Signed-off-by: Xingyu Liu <[email protected]>

xieyangxu requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners December 2, 2025 01:39

gemini-code-assist bot reviewed Dec 2, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Dec 2, 2025

View reviewed changes

hmellor reviewed Dec 2, 2025

View reviewed changes

zhuohan123 approved these changes Dec 3, 2025

View reviewed changes

xieyangxu force-pushed the export-D88115795 branch from 9e1e5d5 to 3eb3af1 Compare December 3, 2025 00:29

xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025

enable multi-node in external launcher mode (vllm-project#29833)

3eb3af1

Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795

xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025

enable multi-node in external launcher mode (vllm-project#29833)

4a49612

Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795

xieyangxu force-pushed the export-D88115795 branch from 3eb3af1 to 4a49612 Compare December 3, 2025 00:29

gemini-code-assist bot reviewed Dec 3, 2025

View reviewed changes

xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025

enable multi-node in external launcher mode (vllm-project#29833)

4007371

Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795

xieyangxu force-pushed the export-D88115795 branch from 4a49612 to 4007371 Compare December 3, 2025 00:46

zhuohan123 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 3, 2025

xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025

enable multi-node in external launcher mode (vllm-project#29833)

24988e8

Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795

xieyangxu force-pushed the export-D88115795 branch from 4007371 to 24988e8 Compare December 3, 2025 01:00

xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025

enable multi-node in external launcher mode (vllm-project#29833)

0172fa1

Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795

xieyangxu force-pushed the export-D88115795 branch from 24988e8 to 0172fa1 Compare December 3, 2025 01:07

xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025

enable multi-node in external launcher mode (vllm-project#29833)

66e3fdc

Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795

xieyangxu force-pushed the export-D88115795 branch from 0172fa1 to 66e3fdc Compare December 3, 2025 01:14

xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025

enable multi-node in external launcher mode (vllm-project#29833)

439d6bb

Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795

xieyangxu force-pushed the export-D88115795 branch 2 times, most recently from 439d6bb to 4e5ff2e Compare December 3, 2025 07:21

xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025

enable multi-node in external launcher mode (vllm-project#29833)

4e5ff2e

Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795

xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025

enable multi-node in external launcher mode (vllm-project#29833)

f1d03b5

Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795

xieyangxu force-pushed the export-D88115795 branch from 4e5ff2e to f1d03b5 Compare December 3, 2025 07:30

zhuohan123 enabled auto-merge (squash) December 3, 2025 07:34

xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025

enable multi-node in external launcher mode (vllm-project#29833)

04649c0

Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795

auto-merge was automatically disabled December 3, 2025 07:41
Head branch was pushed to by a user without write access

xieyangxu force-pushed the export-D88115795 branch from f1d03b5 to 04649c0 Compare December 3, 2025 07:41

enable multi-node in external launcher mode (vllm-project#29833)

fad3cae

Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795

xieyangxu force-pushed the export-D88115795 branch from 04649c0 to fad3cae Compare December 3, 2025 18:11

zhuohan123 merged commit ad32e3e into vllm-project:main Dec 4, 2025
52 checks passed

PatrykSaffer pushed a commit to PatrykSaffer/vllm that referenced this pull request Dec 4, 2025

enable multi-node in external launcher mode (vllm-project#29833)

8f4c107

charlotte12l pushed a commit to charlotte12l/vllm that referenced this pull request Dec 5, 2025

enable multi-node in external launcher mode (vllm-project#29833)

27eb202

Signed-off-by: Xingyu Liu <[email protected]>

Uh oh!

enable multi-node in external launcher mode #29833

enable multi-node in external launcher mode #29833

Uh oh!

Conversation

xieyangxu commented Dec 2, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

xieyangxu Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

hmellor left a comment

Choose a reason for hiding this comment

Uh oh!

xieyangxu commented Dec 2, 2025

Uh oh!

hmellor commented Dec 2, 2025

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

xieyangxu commented Dec 3, 2025

Uh oh!

xieyangxu commented Dec 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

xieyangxu Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot commented Dec 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xieyangxu commented Dec 2, 2025 •

edited by github-actions bot

Loading