Skip to content

Conversation

@michal-shalev
Copy link
Contributor

What?

Add internal connection establishment to UCX backend to prevent UCS_ERR_NOT_CONNECTED errors during data transfers.

Why?

UCX requires connection establishment before data transfers, but the current implementation relies on manual workarounds in test code using genNotif/getNotifs polling. This should be handled automatically at the backend level.

How?

  • Add performConnectionEstablishment() method that flushes all endpoints using UCX blocking pattern
  • Call it automatically in loadRemoteConnInfo() after connection setup
  • Remove completeWireup workarounds from all test files

@github-actions
Copy link

👋 Hi michal-shalev! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@michal-shalev
Copy link
Contributor Author

/build


// Flush all endpoints to ensure connection establishment
// and avoid UCS_ERR_NOT_CONNECTED errors during data transfers
for (size_t i = 0; i < conn->eps.size(); ++i) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think flush without any previous request could complete without endpoint being connected. if confirmed, we could first send dummy op, then start flush

// and avoid UCS_ERR_NOT_CONNECTED errors during data transfers
for (size_t i = 0; i < conn->eps.size(); ++i) {
nixlUcxReq req;
nixl_status_t ret = conn->eps[i]->flushEp(req);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it is ucx engine api, i think it would be safer to start all the flush operations, then progress all flush requests in another loop, for the case where we would have inter-dependency somehow.

}

if (ret != NIXL_SUCCESS) {
NIXL_WARN << "Failed to flush endpoint " << i << " for " << remote_agent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return error

void
nixlUcxEngine::performConnectionEstablishment(
const std::string &remote_agent,
const std::shared_ptr<nixlUcxConnection> &conn) const {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const std::shared_ptr<nixlUcxConnection> &conn) const {
const nixlUcxConnection &conn) const {

https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#f7-for-general-use-take-t-or-t-arguments-rather-than-smart-pointers

Or pass just endpoints.

nixl_status_t ret = conn->eps[i]->flushEp(req);

if (ret == NIXL_IN_PROG) {
nixlUcxWorker *worker = getWorker(i).get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nixlUcxWorker & ?

Comment on lines +1829 to +1830
// Flush all endpoints to ensure connection establishment
// and avoid UCS_ERR_NOT_CONNECTED errors during data transfers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it looks like trying to move workaround from user level to ucx backend instead of fixing UCP API, UCP EP should not return NOT_CONNECTED to avoid blocking on any level. Instead, the request should go on pending until completion as any other operation posted on UCP EP.


remoteConnMap.insert({remote_agent, conn});

performConnectionEstablishment(remote_agent, conn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle possible failure

@michal-shalev michal-shalev marked this pull request as draft October 21, 2025 00:05
@michal-shalev
Copy link
Contributor Author

Decided on a different solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants