-
Notifications
You must be signed in to change notification settings - Fork 459
add heterogeneous-tcp #1126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
add heterogeneous-tcp #1126
Conversation
Summary of ChangesHello @AscendTransport, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new HeterogeneousTcpTransport to enable communication between Ascend NPUs and GPUs over TCP. The changes are extensive, touching numerous CMake files to integrate the new build option and source files, and adding the new transport's header and implementation. My review has identified several critical issues, including syntax errors in C++ code and logical flaws in CMake scripts that are likely to cause build failures. Additionally, I've noted a high-severity performance issue related to busy-waiting and multiple medium-severity concerns regarding code maintainability, style, and potential copy-paste errors. I have provided specific code suggestions to address these findings.
| #ifdef USE_ASCEND_HETEROGENEOUS_TCP | ||
| else if (std::string(proto) == "ascend") { | ||
| transport = new HeterogeneousTcpTransport(); | ||
| } | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block continues a pattern of broken if/else if chains due to the use of separate #ifdef blocks. If both USE_ASCEND_HETEROGENEOUS and USE_ASCEND_HETEROGENEOUS_TCP are defined, this will cause a compile error. The logic for selecting the 'ascend' transport should be consolidated into a single else if block with nested #if/#elif/#endif preprocessor directives to correctly handle mutual exclusivity and prevent build failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maintain consistency with the aforementioned branch
| free_list_.resize(kSlabSizeKBTabLen); | ||
| #if !defined(USE_ASCEND) && !defined(USE_ASCEND_DIRECT) && \ | ||
| !defined(USE_ASCEND_HETEROGENEOUS) | ||
| !defined(USE_ASCEND_HETEROGENEOUS) && && !defined(USE_ASCEND_HETEROGENEOUS_TCP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a syntax error in the preprocessor directive. The && && is a typo and will cause a compilation failure. It should be a single &&.
| !defined(USE_ASCEND_HETEROGENEOUS) && && !defined(USE_ASCEND_HETEROGENEOUS_TCP) | |
| !defined(USE_ASCEND_HETEROGENEOUS) && !defined(USE_ASCEND_HETEROGENEOUS_TCP) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
| if (transferQueues_.empty()) { | ||
| transfer_cond_.wait(lock, [this] { return !transferQueues_.empty() || !running_; }); | ||
| if (transferQueues_.empty()) { | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code for waiting on the transfer queue is syntactically incorrect and will not compile. The if statement is unnecessary and its structure is malformed. The correct pattern is to call wait on the condition variable directly with a predicate. This also correctly handles spurious wakeups.
transfer_cond_.wait(lock, [this] { return !transferQueues_.empty() || !running_; });
if (transferQueues_.empty()) {
continue;
}| if (USE_ASCEND_HETEROGENEOUS OR USE_ASCEND_HETEROGENEOUS_TCP) | ||
| add_subdirectory(ascend_transport) | ||
| target_sources(transport PUBLIC $<TARGET_OBJECTS:ascend_transport>) | ||
| endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This if block is redundant. The if block on lines 27-30 already handles adding the ascend_transport subdirectory for other USE_ASCEND flags. If any of those flags are enabled along with USE_ASCEND_HETEROGENEOUS or USE_ASCEND_HETEROGENEOUS_TCP, add_subdirectory(ascend_transport) will be called twice, causing a CMake error. These conditions should be combined into a single if block.
| while (transfer_counter_.load() < usedHugeDevNum) { | ||
| std::this_thread::yield(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This while loop implements a busy-wait using std::this_thread::yield(). This is inefficient and can consume significant CPU resources, especially if the wait is long. A more efficient approach would be to use a std::condition_variable or a C++20 std::counting_semaphore to wait for the transfer tasks to complete without busy-looping.
| if (USE_ASCEND_HETEROGENEOUS OR USE_ASCEND_HETEROGENEOUS_TCP) | ||
| file(GLOB ASCEND_TOOLKIT_ROOT "/usr/local/Ascend/ascend-toolkit/latest/*-linux") | ||
| set(ASCEND_LIB_DIR "${ASCEND_TOOLKIT_ROOT}/lib64") | ||
| set(ASCEND_INCLUDE_DIR "${ASCEND_TOOLKIT_ROOT}/include") | ||
| add_compile_definitions(USE_ASCEND_HETEROGENEOUS) | ||
| if (USE_ASCEND_HETEROGENEOUS) | ||
| add_compile_definitions(USE_ASCEND_HETEROGENEOUS) | ||
| endif() | ||
| if (USE_ASCEND_HETEROGENEOUS_TCP) | ||
| add_compile_definitions(USE_ASCEND_HETEROGENEOUS_TCP) | ||
| endif() | ||
| include_directories(/usr/local/include /usr/include ${ASCEND_INCLUDE_DIR}) | ||
| link_directories(${ASCEND_LIB_DIR}) | ||
| endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is duplicated logic for setting up the Ascend toolkit across multiple CMake files (e.g., mooncake-transfer-engine/CMakeLists.txt, mooncake-transfer-engine/src/CMakeLists.txt). To improve maintainability and avoid inconsistencies, this configuration should be centralized in this common file. The other files should not repeat this setup.
| return -1; | ||
| } | ||
|
|
||
| hostAddr_ = static_cast<char*>(aligned_alloc(64, HUGE_HOST_SIZE)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| LOG(ERROR) << "rdma transport registerLocalMemory error, ret: " | ||
| << ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| LOG(ERROR) << "rdma transport unregisterLocalMemory error, ret: " | ||
| << ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #define HUGE_HOST_SIZE (3ULL * 1024 * 1024 * 1024) | ||
| #define HUGE_DEVICE_SIZE (8 * 1024 * 1024) | ||
| #define HUGE_DEVICE_NUM 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In modern C++, it is preferred to use constexpr for compile-time constants instead of #define macros. constexpr provides type safety and is scoped, which avoids potential issues with macro expansion.
| #define HUGE_HOST_SIZE (3ULL * 1024 * 1024 * 1024) | |
| #define HUGE_DEVICE_SIZE (8 * 1024 * 1024) | |
| #define HUGE_DEVICE_NUM 4 | |
| constexpr uint64_t HUGE_HOST_SIZE = 3ULL * 1024 * 1024 * 1024; | |
| constexpr uint64_t HUGE_DEVICE_SIZE = 8 * 1024 * 1024; | |
| constexpr int HUGE_DEVICE_NUM = 4; |
d513b3a to
7608bd5
Compare
7608bd5 to
12fd75b
Compare
| Status getTransferStatus(BatchID batch_id, size_t task_id, | ||
| TransferStatus &status) override; | ||
|
|
||
| private: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this change for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to rdma.h, for heterogeneous computing cards, some interfaces need to be exposed for external calls.
What is this change for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between ascend_transport and tcp_transport?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- It supports heterogeneous devices, while ascend_transport does not support heterogeneous devices.
- This pull request is only suitable for situations without RDMA hardware, so it is placed in this directory.
Description
Type of Change
How Has This Been Tested?
Checklist