Skip to content

Conversation

@xieyangxu
Copy link
Contributor

@xieyangxu xieyangxu commented Dec 2, 2025

Multi-node (nnodes > 1) in external launcher mode is a valid use case.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables multi-node support for the external_launcher distributed backend by allowing nnodes > 1. While the change in vllm/config/parallel.py is correct in principle, my review has identified a critical issue that this change exposes in the distributed environment initialization logic. Specifically, the init_distributed_environment function in vllm/distributed/parallel_state.py will incorrectly override the distributed_init_method for external_launcher in a multi-node setup, which will likely cause failures. This needs to be addressed for the feature to work as intended.

Comment on lines 596 to 600
if self.distributed_executor_backend not in ("mp", "uni", "external_launcher") and self.nnodes > 1:
raise ValueError(
"nnodes > 1 can only be set when distributed executor "
"backend is mp or uni."
"backend is mp, uni or external_launcher."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This change correctly allows nnodes > 1 for the external_launcher backend. However, it exposes a critical issue in vllm/distributed/parallel_state.py.

The init_distributed_environment function (lines 1172-1178) unconditionally overrides distributed_init_method when nnodes > 1. This will break external_launcher, which relies on distributed_init_method="env://" for multi-node setup with tools like torchrun. The override logic uses master_addr and master_port, which is intended for the mp backend.

To fix this, the logic in init_distributed_environment should be made conditional, for example:

if (config is not None and config.parallel_config.nnodes > 1 and
        config.parallel_config.distributed_executor_backend == "mp"):
    # ... existing override logic

Without this correction in the related file, multi-node execution with external_launcher will likely fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated accordingly and refactored the if-elif branch in parallel_state.py for simplicity.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 596 to 599
if self.distributed_executor_backend not in ("mp", "uni", "external_launcher") and self.nnodes > 1:
raise ValueError(
"nnodes > 1 can only be set when distributed executor "
"backend is mp or uni."
"backend is mp, uni or external_launcher."

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid double-counting DP world size for external_launcher multi-node

Allowing nnodes > 1 with distributed_executor_backend="external_launcher" (596-599) now sends these runs down the multi-node init path, but in this mode world_size has already been multiplied by data_parallel_size in post_init (lines 510-512) and world_size_across_dp multiplies by data_parallel_size again (lines 323-326). With external launcher configs that set both data_parallel_size>1 and nnodes>1, init_distributed_environment will call init_process_group with a world size that is data_parallel_size times larger than the torchrun WORLD_SIZE, leading to initialization failure/hangs for those multi-node jobs. The DP factor should not be applied twice for external_launcher when nnodes>1.

Useful? React with 👍 / 👎.

Copy link
Member

@hmellor hmellor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this an oversight or was it previously unsupported?

If it was previously unsupported, can you link to where support was added?

@xieyangxu
Copy link
Contributor Author

Was this an oversight or was it previously unsupported?

If it was previously unsupported, can you link to where support was added?

#23691 added multi-node support originally for mp backend but it works for external launcher mode as well.

@hmellor
Copy link
Member

hmellor commented Dec 2, 2025

Thanks for the extra context, would it be worth adding a test?

Also pre-commit is failing, try:

uv pip install pre-commit
pre-commit install
pre-commit run -a

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the codex comment to add more asserts and also consider adding some tests.

xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025
Summary: Pull Request resolved: vllm-project#29833

Differential Revision: D88115795
xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025
Summary: Pull Request resolved: vllm-project#29833

Differential Revision: D88115795
@xieyangxu
Copy link
Contributor Author

/gemini review

@xieyangxu
Copy link
Contributor Author

@codex review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables multi-node support for the external_launcher mode. This is achieved by allowing nnodes > 1 with this backend and letting torch.distributed handle initialization from environment variables, which is the correct approach for external launchers. The related logic for other distributed backends has been refactored to accommodate this change. The changes look good, but I've identified a minor regression in the refactoring that affects logging, which could be confusing for debugging.

Comment on lines 1187 to 1201
# Use appropriate IP and port based on configuration
if parallel_config.nnodes > 1:
ip = parallel_config.master_addr
port = parallel_config.master_port
else:
ip = parallel_config.data_parallel_master_ip
port = parallel_config.get_next_dp_init_port()
logger.debug(
"Adjusting world_size=%d rank=%d distributed_init_method=%s for DP",
world_size,
rank,
distributed_init_method,
)

distributed_init_method = get_distributed_init_method(ip, port)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This refactoring has introduced a regression. The logger.debug call on line 1194 now logs the value of distributed_init_method before it is updated with the new ip and port for the data parallelism case (the update happens on line 1201). In the original code, the log happened after the update. This can be misleading when debugging.

To fix this, the get_distributed_init_method call should be moved to before it is used in the log message.

Suggested change
# Use appropriate IP and port based on configuration
if parallel_config.nnodes > 1:
ip = parallel_config.master_addr
port = parallel_config.master_port
else:
ip = parallel_config.data_parallel_master_ip
port = parallel_config.get_next_dp_init_port()
logger.debug(
"Adjusting world_size=%d rank=%d distributed_init_method=%s for DP",
world_size,
rank,
distributed_init_method,
)
distributed_init_method = get_distributed_init_method(ip, port)
# Use appropriate IP and port based on configuration
if parallel_config.nnodes > 1:
ip = parallel_config.master_addr
port = parallel_config.master_port
distributed_init_method = get_distributed_init_method(ip, port)
else:
ip = parallel_config.data_parallel_master_ip
port = parallel_config.get_next_dp_init_port()
distributed_init_method = get_distributed_init_method(ip, port)
logger.debug(
"Adjusting world_size=%d rank=%d distributed_init_method=%s for DP",
world_size,
rank,
distributed_init_method,
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@chatgpt-codex-connector
Copy link

Codex Review: Didn't find any major issues. Delightful!

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025
Summary: Pull Request resolved: vllm-project#29833

Differential Revision: D88115795
@zhuohan123 zhuohan123 added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 3, 2025
xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025
Summary: Pull Request resolved: vllm-project#29833

Differential Revision: D88115795
xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025
Summary: Pull Request resolved: vllm-project#29833

Differential Revision: D88115795
xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025
Summary: Pull Request resolved: vllm-project#29833

Differential Revision: D88115795
xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025
Summary: Pull Request resolved: vllm-project#29833

Differential Revision: D88115795
@xieyangxu xieyangxu force-pushed the export-D88115795 branch 2 times, most recently from 439d6bb to 4e5ff2e Compare December 3, 2025 07:21
xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025
Summary: Pull Request resolved: vllm-project#29833

Differential Revision: D88115795
xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025
Summary: Pull Request resolved: vllm-project#29833

Differential Revision: D88115795
@zhuohan123 zhuohan123 enabled auto-merge (squash) December 3, 2025 07:34
xieyangxu added a commit to xieyangxu/vllm that referenced this pull request Dec 3, 2025
Summary: Pull Request resolved: vllm-project#29833

Differential Revision: D88115795
auto-merge was automatically disabled December 3, 2025 07:41

Head branch was pushed to by a user without write access

Summary: Pull Request resolved: vllm-project#29833

Differential Revision: D88115795
@zhuohan123 zhuohan123 merged commit ad32e3e into vllm-project:main Dec 4, 2025
52 checks passed
PatrykSaffer pushed a commit to PatrykSaffer/vllm that referenced this pull request Dec 4, 2025
charlotte12l pushed a commit to charlotte12l/vllm that referenced this pull request Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants