Skip to content

Conversation

staryxchen
Copy link
Contributor

@staryxchen staryxchen commented Jul 25, 2025

Summary

This PR adds backward compatibility for GPU memory naming by including both "cuda:xx" and "gpu:xx" entries during topology discovery.

Background

In earlier versions of mooncake, the GPU0 memory was referred to as "gpu:0" rather than "cuda:0," so some users would refer to the memory as "gpu:0" when registering it. To maintain compatibility, this PR adds both "cuda:xx" and "gpu:xx" entries during topology discovery.

Changes

  • topology.cpp: Modified discoverCudaTopology() function to add "gpu:xx" entries alongside existing "cuda:xx" entries
  • transfer_engine.cpp: Output topology information for diagnosing anomalies

Impact

  • Maintains backward compatibility with existing code that uses "gpu:xx" naming convention
  • Allows users to register memory using either "cuda:xx" or "gpu:xx" format
  • No breaking changes to existing functionality

.avail_hca = avail_hca});
topology.push_back(
TopologyEntry{.name = "cuda:" + std::to_string(i),
TopologyEntry{.name = "gpu:" + std::to_string(i),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wonder what the difference is between cuda:x and gpu:x, it seems the same thing in torch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is no difference for the Torch API. However, now, in Mooncake, we select the device to use based on the segment buffer name, which is a string comparison. Therefore, if the user registers the memory and declares it as “gpu:”, Mooncake will fall back to using the device corresponding to kWildcardLocation because it cannot find a matching string.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, I do think people should use cuda: instead of gpu:, see document here: https://github.com/kvcache-ai/Mooncake/blob/main/doc/zh/transfer-engine.md#transferengineregisterlocalmemory

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the name may lead compatibility problem after merging it.

Copy link
Collaborator

@alogfans alogfans Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, you can check whether there is any hard-coded "cuda:" in elsewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other words, you can check whether there is any hard-coded "cuda:" in elsewhere.

mooncake internal (location and topology) uniformly uses cuda:xx.

Add detailed topology info in discovery and transport installation logs

Signed-off-by: staryxchen <[email protected]>
@stmatengss
Copy link
Collaborator

I will review it later. @starychen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants