Skip to content

[distributed] Slow xgboost distributed communicator setup when scaling past 150 nodes #11270

@justinvyu

Description

@justinvyu

I'm running distributed xgboost on a large CPU cluster with 200 m5.4xlarge (16 CPU, 64 GB memory) instances. I'm running with Ray Train's XGBoostTrainer, which just launches processes on all 200 machines and sets the necessary distributed parameters and wraps the user code with xgboost.collective.CommunicatorContext(**distributed_kwargs).

When using 200 nodes, a worker will eventually crash with a socket connection timed out error, and it will never get passed this CommunicatorContext context manager call:

  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/train/xgboost/config.py", line 44, in collective_communication_context
    with CommunicatorContext(**_get_xgboost_args()):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/xgboost/collective.py", line 279, in __enter__
    init(**self.args)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/xgboost/collective.py", line 51, in init
    _check_call(_LIB.XGCommunicatorInit(make_jcargs(**args)))
  File "/home/ray/anaconda3/lib/python3.11/site-packages/xgboost/core.py", line 284, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [02:15:49] /workspace/src/collective/result.cc:78: 
- [comm.cc:219|02:15:49]: Failed to bootstrap the communication group.
- [comm.cc:330|02:15:49]: Failed to connect to other workers.
- [socket.cc:185|02:15:49]: Failed to connect to 172.24.35.81:34313
- [socket.h:357|02:15:49]: Socket error. Connection timed out

I can avoid the error by setting the CommunicatorContext({"dmlc_retry": 100}) to a large number. However, I often see that a few workers (maybe 4 of the 200) never enter the context (the call to init never finishes).

Profile of communicator init() time

I timed how long it takes to enter the with CommunicatorContext(...): with a few different settings of num_workers. Each of the graphs shows the world rank of the worker on the x axis, and the y axis shows the time it took to enter the distributed context.

Observation: the straggler among the first 20 workers seems to be the bottleneck for the rest of workers past rank 20.

100 workers

Image

120 workers

120 workers takes longer.

Image

200 workers

Sometimes all workers finish initializing properly, but after a delay (not consistent how long this delay is). Here's 2 examples:

Image

Image

Sometimes all but a few workers finish. The stragglers are hanging there forever without any more warning messages. See the gaps in data at around rank 25.

Image

Questions

  1. Is this scale (200+ nodes) something that the xgboost team has tested before?
  2. How to initialize the group more quickly?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions