-
Notifications
You must be signed in to change notification settings - Fork 387
feat(topology): add GPU entries in topology discovery #673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: staryxchen <[email protected]>
.avail_hca = avail_hca}); | ||
topology.push_back( | ||
TopologyEntry{.name = "cuda:" + std::to_string(i), | ||
TopologyEntry{.name = "gpu:" + std::to_string(i), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wonder what the difference is between cuda:x and gpu:x, it seems the same thing in torch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is no difference for the Torch API. However, now, in Mooncake, we select the device to use based on the segment buffer name, which is a string comparison. Therefore, if the user registers the memory and declares it as “gpu:”, Mooncake will fall back to using the device corresponding to kWildcardLocation because it cannot find a matching string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, I do think people should use cuda:
instead of gpu:
, see document here: https://github.com/kvcache-ai/Mooncake/blob/main/doc/zh/transfer-engine.md#transferengineregisterlocalmemory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing the name may lead compatibility problem after merging it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, you can check whether there is any hard-coded "cuda:" in elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other words, you can check whether there is any hard-coded "cuda:" in elsewhere.
mooncake internal (location and topology) uniformly uses cuda:xx
.
Add detailed topology info in discovery and transport installation logs Signed-off-by: staryxchen <[email protected]>
I will review it later. @starychen |
Summary
This PR adds backward compatibility for GPU memory naming by including both "cuda:xx" and "gpu:xx" entries during topology discovery.
Background
In earlier versions of mooncake, the GPU0 memory was referred to as "gpu:0" rather than "cuda:0," so some users would refer to the memory as "gpu:0" when registering it. To maintain compatibility, this PR adds both "cuda:xx" and "gpu:xx" entries during topology discovery.
Changes
Impact