-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
enable multi-node in external launcher mode #29833
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables multi-node support for the external_launcher distributed backend by allowing nnodes > 1. While the change in vllm/config/parallel.py is correct in principle, my review has identified a critical issue that this change exposes in the distributed environment initialization logic. Specifically, the init_distributed_environment function in vllm/distributed/parallel_state.py will incorrectly override the distributed_init_method for external_launcher in a multi-node setup, which will likely cause failures. This needs to be addressed for the feature to work as intended.
vllm/config/parallel.py
Outdated
| if self.distributed_executor_backend not in ("mp", "uni", "external_launcher") and self.nnodes > 1: | ||
| raise ValueError( | ||
| "nnodes > 1 can only be set when distributed executor " | ||
| "backend is mp or uni." | ||
| "backend is mp, uni or external_launcher." | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change correctly allows nnodes > 1 for the external_launcher backend. However, it exposes a critical issue in vllm/distributed/parallel_state.py.
The init_distributed_environment function (lines 1172-1178) unconditionally overrides distributed_init_method when nnodes > 1. This will break external_launcher, which relies on distributed_init_method="env://" for multi-node setup with tools like torchrun. The override logic uses master_addr and master_port, which is intended for the mp backend.
To fix this, the logic in init_distributed_environment should be made conditional, for example:
if (config is not None and config.parallel_config.nnodes > 1 and
config.parallel_config.distributed_executor_backend == "mp"):
# ... existing override logicWithout this correction in the related file, multi-node execution with external_launcher will likely fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated accordingly and refactored the if-elif branch in parallel_state.py for simplicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm/config/parallel.py
Outdated
| if self.distributed_executor_backend not in ("mp", "uni", "external_launcher") and self.nnodes > 1: | ||
| raise ValueError( | ||
| "nnodes > 1 can only be set when distributed executor " | ||
| "backend is mp or uni." | ||
| "backend is mp, uni or external_launcher." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid double-counting DP world size for external_launcher multi-node
Allowing nnodes > 1 with distributed_executor_backend="external_launcher" (596-599) now sends these runs down the multi-node init path, but in this mode world_size has already been multiplied by data_parallel_size in post_init (lines 510-512) and world_size_across_dp multiplies by data_parallel_size again (lines 323-326). With external launcher configs that set both data_parallel_size>1 and nnodes>1, init_distributed_environment will call init_process_group with a world size that is data_parallel_size times larger than the torchrun WORLD_SIZE, leading to initialization failure/hangs for those multi-node jobs. The DP factor should not be applied twice for external_launcher when nnodes>1.
Useful? React with 👍 / 👎.
hmellor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this an oversight or was it previously unsupported?
If it was previously unsupported, can you link to where support was added?
#23691 added multi-node support originally for mp backend but it works for external launcher mode as well. |
|
Thanks for the extra context, would it be worth adding a test? Also pre-commit is failing, try: uv pip install pre-commit
pre-commit install
pre-commit run -a |
zhuohan123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the codex comment to add more asserts and also consider adding some tests.
9e1e5d5 to
3eb3af1
Compare
Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795
Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795
3eb3af1 to
4a49612
Compare
|
/gemini review |
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables multi-node support for the external_launcher mode. This is achieved by allowing nnodes > 1 with this backend and letting torch.distributed handle initialization from environment variables, which is the correct approach for external launchers. The related logic for other distributed backends has been refactored to accommodate this change. The changes look good, but I've identified a minor regression in the refactoring that affects logging, which could be confusing for debugging.
vllm/distributed/parallel_state.py
Outdated
| # Use appropriate IP and port based on configuration | ||
| if parallel_config.nnodes > 1: | ||
| ip = parallel_config.master_addr | ||
| port = parallel_config.master_port | ||
| else: | ||
| ip = parallel_config.data_parallel_master_ip | ||
| port = parallel_config.get_next_dp_init_port() | ||
| logger.debug( | ||
| "Adjusting world_size=%d rank=%d distributed_init_method=%s for DP", | ||
| world_size, | ||
| rank, | ||
| distributed_init_method, | ||
| ) | ||
|
|
||
| distributed_init_method = get_distributed_init_method(ip, port) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refactoring has introduced a regression. The logger.debug call on line 1194 now logs the value of distributed_init_method before it is updated with the new ip and port for the data parallelism case (the update happens on line 1201). In the original code, the log happened after the update. This can be misleading when debugging.
To fix this, the get_distributed_init_method call should be moved to before it is used in the log message.
| # Use appropriate IP and port based on configuration | |
| if parallel_config.nnodes > 1: | |
| ip = parallel_config.master_addr | |
| port = parallel_config.master_port | |
| else: | |
| ip = parallel_config.data_parallel_master_ip | |
| port = parallel_config.get_next_dp_init_port() | |
| logger.debug( | |
| "Adjusting world_size=%d rank=%d distributed_init_method=%s for DP", | |
| world_size, | |
| rank, | |
| distributed_init_method, | |
| ) | |
| distributed_init_method = get_distributed_init_method(ip, port) | |
| # Use appropriate IP and port based on configuration | |
| if parallel_config.nnodes > 1: | |
| ip = parallel_config.master_addr | |
| port = parallel_config.master_port | |
| distributed_init_method = get_distributed_init_method(ip, port) | |
| else: | |
| ip = parallel_config.data_parallel_master_ip | |
| port = parallel_config.get_next_dp_init_port() | |
| distributed_init_method = get_distributed_init_method(ip, port) | |
| logger.debug( | |
| "Adjusting world_size=%d rank=%d distributed_init_method=%s for DP", | |
| world_size, | |
| rank, | |
| distributed_init_method, | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
|
Codex Review: Didn't find any major issues. Delightful! ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795
4a49612 to
4007371
Compare
Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795
4007371 to
24988e8
Compare
Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795
24988e8 to
0172fa1
Compare
Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795
0172fa1 to
66e3fdc
Compare
Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795
439d6bb to
4e5ff2e
Compare
Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795
Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795
4e5ff2e to
f1d03b5
Compare
Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795
Head branch was pushed to by a user without write access
f1d03b5 to
04649c0
Compare
Summary: Pull Request resolved: vllm-project#29833 Differential Revision: D88115795
04649c0 to
fad3cae
Compare
Signed-off-by: Xingyu Liu <[email protected]>
Multi-node (nnodes > 1) in external launcher mode is a valid use case.