-
Notifications
You must be signed in to change notification settings - Fork 169
UCX/BACKEND: Add internal connection establishment #819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCX/BACKEND: Add internal connection establishment #819
Conversation
|
👋 Hi michal-shalev! Thank you for contributing to ai-dynamo/nixl. Your PR reviewers will review your contribution then trigger the CI to test your changes. 🚀 |
Signed-off-by: Michal Shalev <[email protected]>
4f2a5da to
cb389d9
Compare
|
/build |
|
|
||
| // Flush all endpoints to ensure connection establishment | ||
| // and avoid UCS_ERR_NOT_CONNECTED errors during data transfers | ||
| for (size_t i = 0; i < conn->eps.size(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think flush without any previous request could complete without endpoint being connected. if confirmed, we could first send dummy op, then start flush
| // and avoid UCS_ERR_NOT_CONNECTED errors during data transfers | ||
| for (size_t i = 0; i < conn->eps.size(); ++i) { | ||
| nixlUcxReq req; | ||
| nixl_status_t ret = conn->eps[i]->flushEp(req); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since it is ucx engine api, i think it would be safer to start all the flush operations, then progress all flush requests in another loop, for the case where we would have inter-dependency somehow.
| } | ||
|
|
||
| if (ret != NIXL_SUCCESS) { | ||
| NIXL_WARN << "Failed to flush endpoint " << i << " for " << remote_agent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return error
| void | ||
| nixlUcxEngine::performConnectionEstablishment( | ||
| const std::string &remote_agent, | ||
| const std::shared_ptr<nixlUcxConnection> &conn) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| const std::shared_ptr<nixlUcxConnection> &conn) const { | |
| const nixlUcxConnection &conn) const { |
Or pass just endpoints.
| nixl_status_t ret = conn->eps[i]->flushEp(req); | ||
|
|
||
| if (ret == NIXL_IN_PROG) { | ||
| nixlUcxWorker *worker = getWorker(i).get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nixlUcxWorker & ?
| // Flush all endpoints to ensure connection establishment | ||
| // and avoid UCS_ERR_NOT_CONNECTED errors during data transfers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me it looks like trying to move workaround from user level to ucx backend instead of fixing UCP API, UCP EP should not return NOT_CONNECTED to avoid blocking on any level. Instead, the request should go on pending until completion as any other operation posted on UCP EP.
|
|
||
| remoteConnMap.insert({remote_agent, conn}); | ||
|
|
||
| performConnectionEstablishment(remote_agent, conn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle possible failure
|
Decided on a different solution |
What?
Add internal connection establishment to UCX backend to prevent
UCS_ERR_NOT_CONNECTEDerrors during data transfers.Why?
UCX requires connection establishment before data transfers, but the current implementation relies on manual workarounds in test code using
genNotif/getNotifspolling. This should be handled automatically at the backend level.How?
performConnectionEstablishment()method that flushes all endpoints using UCX blocking patternloadRemoteConnInfo()after connection setupcompleteWireupworkarounds from all test files