[distributed] Slow xgboost distributed communicator setup when scaling past 150 nodes

I'm running distributed xgboost on a large CPU cluster with 200 m5.4xlarge (16 CPU, 64 GB memory) instances. I'm running with Ray Train's [XGBoostTrainer](https://github.com/ray-project/ray/tree/master/python/ray/train/xgboost), which just launches processes on all 200 machines and sets the necessary distributed parameters and wraps the user code with `xgboost.collective.CommunicatorContext(**distributed_kwargs)`.

When using 200 nodes, a worker will eventually crash with a socket connection timed out error, and it will never get passed this `CommunicatorContext` context manager call:
```python
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/train/xgboost/config.py", line 44, in collective_communication_context
    with CommunicatorContext(**_get_xgboost_args()):
  File "/home/ray/anaconda3/lib/python3.11/site-packages/xgboost/collective.py", line 279, in __enter__
    init(**self.args)
  File "/home/ray/anaconda3/lib/python3.11/site-packages/xgboost/collective.py", line 51, in init
    _check_call(_LIB.XGCommunicatorInit(make_jcargs(**args)))
  File "/home/ray/anaconda3/lib/python3.11/site-packages/xgboost/core.py", line 284, in _check_call
    raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [02:15:49] /workspace/src/collective/result.cc:78: 
- [comm.cc:219|02:15:49]: Failed to bootstrap the communication group.
- [comm.cc:330|02:15:49]: Failed to connect to other workers.
- [socket.cc:185|02:15:49]: Failed to connect to 172.24.35.81:34313
- [socket.h:357|02:15:49]: Socket error. Connection timed out
```

I can avoid the error by setting the `CommunicatorContext({"dmlc_retry": 100})` to a large number. However, I often see that a few workers (maybe 4 of the 200) **never** enter the context (the call to `init` never finishes).

## Profile of communicator `init()` time

I timed how long it takes to enter the `with CommunicatorContext(...):` with a few different settings of `num_workers`. Each of the graphs shows the world rank of the worker on the x axis, and the y axis shows the time it took to enter the distributed context.

**Observation:** the straggler among the first 20 workers seems to be the bottleneck for the rest of workers past rank 20.

### 100 workers

![Image](https://github.com/user-attachments/assets/cf606068-ed84-43f0-9818-c1458f2b81f8)

### 120 workers

120 workers takes longer.

![Image](https://github.com/user-attachments/assets/9f8e87b5-27ec-4a9a-8fac-bc77d6f145f6)

### 200 workers

Sometimes all workers finish initializing properly, but after a delay (not consistent how long this delay is). Here's 2 examples:

![Image](https://github.com/user-attachments/assets/e9391aaf-cc95-457d-a7ab-af17ce821032)

![Image](https://github.com/user-attachments/assets/19ffd2a8-8bd3-4cfa-b4fc-4fe074d3ab8b)

Sometimes all but a few workers finish. The stragglers are hanging there forever without any more warning messages. See the gaps in data at around rank 25.

![Image](https://github.com/user-attachments/assets/ba115b97-60b7-486c-97d3-487d27966d50)


## Questions

1. Is this scale (200+ nodes) something that the xgboost team has tested before?
2. How to initialize the group more quickly?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[distributed] Slow xgboost distributed communicator setup when scaling past 150 nodes #11270

Profile of communicator `init()` time

100 workers

120 workers

200 workers

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[distributed] Slow xgboost distributed communicator setup when scaling past 150 nodes #11270

Description

Profile of communicator init() time

100 workers

120 workers

200 workers

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Profile of communicator `init()` time