Skip to content

Conversation

adiprerepa
Copy link

@adiprerepa adiprerepa commented Aug 20, 2025

Found while using IPC NVLink transport with SGLang, sender will continue holding an opened IPC handle to receiver even if receiver dies and deallocates its memory. Adds ability for sender to close the ipc handle with cudaIpcCloseMemHandle, which is triggered by engine.free_remote_segment.

cc @ShangmingCai @ByronHsu

@ShangmingCai ShangmingCai requested a review from alogfans August 20, 2025 00:57
.def("batch_unregister_memory",
&TransferEnginePy::batchUnregisterMemory)
.def("get_local_topology", &TransferEnginePy::getLocalTopology)
.def("free_remote_segment", &TransferEnginePy::freeRemoteSegment)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming it as "close_remote_segment" may be better.

LOG(ERROR) << "NvlinkTransport: cudaIpcCloseMemHandle failed: "
<< cudaGetErrorString(err);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add one line to remove this entry in remap_entries_ after cudaIpcCloseMemHandle.

@alogfans
Copy link
Collaborator

alogfans commented Sep 3, 2025

Please fix the CI issues before further reviewing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants