You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
Currently, it requires passing a rank at creation time. However, rank can only be determined after the communicator is created, so semantically, it is not possible to provide a rank during creation. Even if we ignore semantics and pass a rank—as users often do—this approach only works for non-fault-tolerant scenarios, where the scheduler creates processes and sets the environment variable `RANK`.
In fault-tolerant scenarios, we have elastic process creation: processes can be added or removed at any time. This means we encounter the same issue—at creation time, we cannot specify ranks, as ranks can only be determined at initialization, once all participants are known.
To address this, we are moving MCCL to a new API where ranks are not provided at creation time.
At initialization, we support two options:
1. **User does not care about rank order:**
MCCL `init` accepts a `std::unordered_set` of URLs, and MCCL is free to assign ranks internally based on its own considerations.
2. **User wants to specify rank order:**
For non-fault-tolerant cases, where ranks are defined by the environment variable `RANK`, the user passes a `std::vector` of URLs to MCCL. MCCL will respect the order of URLs and assign ranks accordingly. Using a vector also ensures rank properties are satisfied: ranks go sequentially from 0 to `nRanks - 1`, with no repetitions or missing ranks.
Reviewed By: saifhhasan
Differential Revision: D84839780
fbshipit-source-id: c482b47976cf8711fca737c0f107896bc3bfe558
0 commit comments