feat(topology): add GPU entries in topology discovery #673

staryxchen · 2025-07-25T13:51:28Z

Summary

This PR adds backward compatibility for GPU memory naming by including both "cuda:xx" and "gpu:xx" entries during topology discovery.

Background

In earlier versions of mooncake, the GPU0 memory was referred to as "gpu:0" rather than "cuda:0," so some users would refer to the memory as "gpu:0" when registering it. To maintain compatibility, this PR adds both "cuda:xx" and "gpu:xx" entries during topology discovery.

Changes

topology.cpp: Modified discoverCudaTopology() function to add "gpu:xx" entries alongside existing "cuda:xx" entries
transfer_engine.cpp: Output topology information for diagnosing anomalies

Impact

Maintains backward compatibility with existing code that uses "gpu:xx" naming convention
Allows users to register memory using either "cuda:xx" or "gpu:xx" format
No breaking changes to existing functionality

Signed-off-by: staryxchen <[email protected]>

stmatengss · 2025-07-27T04:33:43Z

mooncake-transfer-engine/src/topology.cpp

+                                         .avail_hca = avail_hca});
        topology.push_back(
-            TopologyEntry{.name = "cuda:" + std::to_string(i),
+            TopologyEntry{.name = "gpu:" + std::to_string(i),


Just wonder what the difference is between cuda:x and gpu:x, it seems the same thing in torch.

Yes, there is no difference for the Torch API. However, now, in Mooncake, we select the device to use based on the segment buffer name, which is a string comparison. Therefore, if the user registers the memory and declares it as “gpu:”, Mooncake will fall back to using the device corresponding to kWildcardLocation because it cannot find a matching string.

nope, I do think people should use cuda: instead of gpu:, see document here: https://github.com/kvcache-ai/Mooncake/blob/main/doc/zh/transfer-engine.md#transferengineregisterlocalmemory

Changing the name may lead compatibility problem after merging it.

In other words, you can check whether there is any hard-coded "cuda:" in elsewhere.

In other words, you can check whether there is any hard-coded "cuda:" in elsewhere.

mooncake internal (location and topology) uniformly uses cuda:xx.

Add detailed topology info in discovery and transport installation logs Signed-off-by: staryxchen <[email protected]>

stmatengss · 2025-08-09T16:22:47Z

I will review it later. @starychen

feat(topology): add GPU entries alongside CUDA entries in topo discovery

3e052c0

Signed-off-by: staryxchen <[email protected]>

staryxchen force-pushed the opt/gpu branch from 5209952 to 1683a5c Compare July 25, 2025 15:22

stmatengss reviewed Jul 27, 2025

View reviewed changes

feat(transfer_engine): enhance topology logging

650ad76

Add detailed topology info in discovery and transport installation logs Signed-off-by: staryxchen <[email protected]>

staryxchen force-pushed the opt/gpu branch from 1683a5c to 650ad76 Compare July 27, 2025 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(topology): add GPU entries in topology discovery #673

feat(topology): add GPU entries in topology discovery #673

Uh oh!

staryxchen commented Jul 25, 2025 •

edited

Loading

Uh oh!

stmatengss Jul 27, 2025

Uh oh!

staryxchen Jul 27, 2025

Uh oh!

doujiang24 Jul 27, 2025

Uh oh!

alogfans Jul 28, 2025

Uh oh!

alogfans Jul 31, 2025 •

edited

Loading

Uh oh!

staryxchen Aug 1, 2025

Uh oh!

stmatengss commented Aug 9, 2025

Uh oh!

Uh oh!

feat(topology): add GPU entries in topology discovery #673

Are you sure you want to change the base?

feat(topology): add GPU entries in topology discovery #673

Uh oh!

Conversation

staryxchen commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Changes

Impact

Uh oh!

stmatengss Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

staryxchen Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

doujiang24 Jul 27, 2025

Choose a reason for hiding this comment

Uh oh!

alogfans Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

alogfans Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

staryxchen Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

stmatengss commented Aug 9, 2025

Uh oh!

Uh oh!

staryxchen commented Jul 25, 2025 •

edited

Loading

alogfans Jul 31, 2025 •

edited

Loading