-
-
Notifications
You must be signed in to change notification settings - Fork 8.8k
Description
I'm running distributed xgboost on a large CPU cluster with 200 m5.4xlarge (16 CPU, 64 GB memory) instances. I'm running with Ray Train's XGBoostTrainer, which just launches processes on all 200 machines and sets the necessary distributed parameters and wraps the user code with xgboost.collective.CommunicatorContext(**distributed_kwargs)
.
When using 200 nodes, a worker will eventually crash with a socket connection timed out error, and it will never get passed this CommunicatorContext
context manager call:
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/train/xgboost/config.py", line 44, in collective_communication_context
with CommunicatorContext(**_get_xgboost_args()):
File "/home/ray/anaconda3/lib/python3.11/site-packages/xgboost/collective.py", line 279, in __enter__
init(**self.args)
File "/home/ray/anaconda3/lib/python3.11/site-packages/xgboost/collective.py", line 51, in init
_check_call(_LIB.XGCommunicatorInit(make_jcargs(**args)))
File "/home/ray/anaconda3/lib/python3.11/site-packages/xgboost/core.py", line 284, in _check_call
raise XGBoostError(py_str(_LIB.XGBGetLastError()))
xgboost.core.XGBoostError: [02:15:49] /workspace/src/collective/result.cc:78:
- [comm.cc:219|02:15:49]: Failed to bootstrap the communication group.
- [comm.cc:330|02:15:49]: Failed to connect to other workers.
- [socket.cc:185|02:15:49]: Failed to connect to 172.24.35.81:34313
- [socket.h:357|02:15:49]: Socket error. Connection timed out
I can avoid the error by setting the CommunicatorContext({"dmlc_retry": 100})
to a large number. However, I often see that a few workers (maybe 4 of the 200) never enter the context (the call to init
never finishes).
Profile of communicator init()
time
I timed how long it takes to enter the with CommunicatorContext(...):
with a few different settings of num_workers
. Each of the graphs shows the world rank of the worker on the x axis, and the y axis shows the time it took to enter the distributed context.
Observation: the straggler among the first 20 workers seems to be the bottleneck for the rest of workers past rank 20.
100 workers
120 workers
120 workers takes longer.
200 workers
Sometimes all workers finish initializing properly, but after a delay (not consistent how long this delay is). Here's 2 examples:
Sometimes all but a few workers finish. The stragglers are hanging there forever without any more warning messages. See the gaps in data at around rank 25.
Questions
- Is this scale (200+ nodes) something that the xgboost team has tested before?
- How to initialize the group more quickly?