Skip to content

Conversation

irexyc
Copy link
Collaborator

@irexyc irexyc commented Mar 28, 2025

Usage

A two node example with tp=4, dp=4 and device_num=4

proxy_server
lmdeploy serve proxy --server-name 10.140.24.141

node 0

LMDEPLOY_DP_MASTER_ADDR=10.140.24.141 \
LMDEPLOY_DP_MASTER_PORT=6900 \
lmdeploy serve api_server \
    Qwen/Qwen2.5-7B-Instruct \
    --server-port 29200 \
    --tp 4 \
    --dp 4 \
    --nnodes 2 \
    --node-rank 0 \
    --ngpus-per-node 2 \
    --proxy-url http://10.140.24.141:8000

node 1

LMDEPLOY_DP_MASTER_ADDR=10.140.24.141 \
LMDEPLOY_DP_MASTER_PORT=6900 \
lmdeploy serve api_server \
    Qwen/Qwen2.5-7B-Instruct \
    --server-port 29202 \
    --tp 4 \
    --dp 4 \
    --nnodes 2 \
    --node-rank 1 \
    --ngpus-per-node 2 \
    --proxy-url http://10.140.24.141:8000

@irexyc irexyc added the WIP label Mar 28, 2025
@irexyc irexyc changed the title [WIP] Add Gloo communication to turobmind Add Gloo communication to turobmind Apr 10, 2025
@irexyc irexyc removed the WIP label Apr 10, 2025
@irexyc
Copy link
Collaborator Author

irexyc commented Apr 10, 2025

oc evaluate diff.csv

@lvhan028 lvhan028 added the enhancement New feature or request label Apr 10, 2025
@lvhan028 lvhan028 requested a review from lzhangzz April 24, 2025 03:45
cfg.devices = cfg.devices if cfg.devices else list(range(cfg.device_num))
cfg.ngpus_per_node = cfg.ngpus_per_node or len(cfg.devices)
# for simplicity, each node has dp
assert cfg.outer_dp_size * cfg.attn_dp_size % cfg.nnodes == 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible that the model does not fit into a single node

Comment on lines +79 to +80
std::string host = std::getenv("LMDEPLOY_DP_MASTER_ADDR");
int port = std::stoi(std::getenv("LMDEPLOY_DP_MASTER_PORT"));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how is DP related to the surrounding context of this code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pt engine use this variables to build dist group. Do we need to choose a new name like dist_init_addr ?

for (int local_rank = 0, offset = engine_param_.ngpus_per_node * engine_param_.node_rank;
local_rank < engine_param_.ngpus_per_node;
++local_rank) {
auto& e = engine_params_[offset + local_rank];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjust the size of engine_params_ to fit ngpus_per_node

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants