-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat: Add support for KVCache transfer from KVCache reuse path #6348
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughReplaces DataResponder/DataRequester/DataSender/DataReceiver with CacheSender/CacheReceiver, introduces TransferSession and session-based cache-formatting/transfer APIs, adds block-hash driven send/receive paths and pinning APIs, removes legacy impl files, updates agent/connection naming, and adjusts tests and build lists accordingly. Changes
Sequence Diagram(s)sequenceDiagram
participant Request as LlmRequest
participant Sender as CacheSender
participant ConnMgr as ConnectionManager
participant Formatter as BaseCacheFormatter
participant Receiver as CacheReceiver
Request->>Sender: sendAsync(request)
Sender->>ConnMgr: build TransferSession (connections, states, blockHashes)
Sender->>Formatter: format(TransferSession)
Sender->>ConnMgr: transmit data (async)
ConnMgr->>Receiver: deliver request info / data
Receiver->>Formatter: unformat(TransferSession)
Receiver->>Request: complete receive (future)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~40 minutes Possibly related PRs
Suggested labels
Suggested reviewers
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🔭 Outside diff range comments (4)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (1)
18-18
: Use preprocessor guard instead of pragma once.According to coding guidelines, header files should use a preprocessor guard with prefix
TRTLLM_
followed by the filename in caps.Replace
#pragma once
with a proper header guard:-#pragma once +#ifndef TRTLLM_CONNECTION_H +#define TRTLLM_CONNECTION_HAnd add at the end of the file:
+#endif // TRTLLM_CONNECTION_H
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (2)
17-17
: Use preprocessor guard instead of pragma once.According to coding guidelines, header files should use a preprocessor guard with prefix
TRTLLM_
followed by the filename in caps.Replace
#pragma once
with a proper header guard:-#pragma once +#ifndef TRTLLM_CACHETRANSCEIVER_H +#define TRTLLM_CACHETRANSCEIVER_HAnd add at the end of the file:
+#endif // TRTLLM_CACHETRANSCEIVER_H
2-2
: Update copyright year to include 2025.According to coding guidelines, all TensorRT-LLM code should contain an NVIDIA copyright header that includes the current year (2025).
- * Copyright (c) 2023-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2023-2025, NVIDIA CORPORATION. All rights reserved.cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
655-658
: Remove unused variable assignment.The variable
cacheType
is assigned but never used. This appears to be dead code.- if (kvFactor == 1) - { - auto cacheType = CacheType::kSELFKONLY; - }
🧹 Nitpick comments (2)
examples/disaggregated/clients/disagg_client.py (1)
81-126
: Consider adding timing measurements tosend_chat_request
for consistency.The
send_chat_request
function lacks the same TTFT and ITL timing measurements that were added tosend_request
. This creates inconsistent behavior between the two endpoints.Consider adding similar timing measurements to
send_chat_request
or extracting the timing logic into a shared helper function to maintain consistency across both endpoints.examples/disaggregated/clients/prompts.json (1)
1-3
: Consider maintaining diverse test prompts for comprehensive testing.While simplifying to a single prompt reduces complexity, having diverse prompts helps test various scenarios (different lengths, complexity, languages, etc.). Consider whether this reduction impacts test coverage adequately.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
(2 hunks)cpp/tensorrt_llm/batch_manager/CMakeLists.txt
(0 hunks)cpp/tensorrt_llm/batch_manager/cacheFormatter.h
(3 hunks)cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
(7 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
(15 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.h
(3 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
(0 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h
(0 hunks)cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
(2 hunks)cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
(2 hunks)cpp/tests/batch_manager/cacheTransceiverTest.cpp
(11 hunks)examples/disaggregated/clients/disagg_client.py
(3 hunks)examples/disaggregated/clients/prompts.json
(1 hunks)
💤 Files with no reviewable changes (3)
- cpp/tensorrt_llm/batch_manager/CMakeLists.txt
- cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h
- cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0, nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and must be case-insensitive unique within a compilation target.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants are uppercase snakecase with prefix...
Files:
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
examples/disaggregated/clients/disagg_client.py
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.
Files:
examples/disaggregated/clients/disagg_client.py
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
🧠 Learnings (3)
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🔇 Additional comments (24)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (1)
178-178
: Consistent renaming verifiedI performed a repository-wide search and confirmed there are no remaining occurrences of
mReceiverBufferDesc
. All references in
- cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
- cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
have been updated to
mCacheReceiverBufferDesc
.LGTM!
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (2)
84-84
: LGTM! Consistent implementation of the renamed member variable.The changes correctly implement the renamed
mCacheReceiverBufferDesc
member variable throughout the implementation, maintaining consistency with the header file updates.Also applies to: 140-142
468-468
: LGTM! Proper namespace closing comment.The namespace closing comment follows the coding guidelines correctly.
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)
37-38
: Renaming Verification CompleteAll references to the old class names have been removed:
- No matches for
DataResponder
- No matches for
DataRequester
- No matches for
mDataSender
ormDataReceiver
The renaming to
CacheSender
/CacheReceiver
is consistent and complete. No further action required.cpp/tests/batch_manager/cacheTransceiverTest.cpp (6)
159-185
: LGTM! Mock class properly renamed and implemented.The mock class follows appropriate testing patterns with proper use of Google Mock features.
186-192
: LGTM! Mock receiver class properly implemented.The mock methods are correctly defined for testing cache receiving functionality.
211-230
: LGTM! Test properly updated to use renamed classes.The test correctly uses
CacheSender
instead ofDataResponder
while maintaining the original test logic.
232-252
: LGTM! Test properly uses CacheReceiver.The test correctly replaces
DataRequester
withCacheReceiver
while maintaining test functionality.
396-404
: LGTM! Cache transceiver setup properly updated.The setup correctly instantiates
CacheSender
andCacheReceiver
with proper implementations usingCacheSenderImpl
andCacheReceiverImpl
.
762-768
: LGTM! Asymmetric cache test properly updated.The test correctly uses
CacheSender
andCacheReceiver
with their implementation classes.cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (3)
196-200
: LGTM! Proper initialization of cache sender and receiver.The constructor correctly initializes
CacheSender
andCacheReceiver
with appropriate parameters and formatter creation.
236-298
: LGTM! Methods properly updated to use new cache classes.All async send/receive operations correctly use
mCacheSender
andmCacheReceiver
with proper future handling.
369-436
: LGTM! Context transfer status checking properly updated.The method correctly uses
mSenderFutures
(renamed frommResponderFutures
) while maintaining the original synchronization logic.cpp/tensorrt_llm/batch_manager/cacheFormatter.h (2)
45-122
: LGTM! Well-designed TransferSession class.The class provides a clean abstraction for managing transfer sessions with proper const-correctness, precondition checks, and encapsulation.
163-222
: LGTM! Performance measurement helper properly implemented.The class provides thread-safe performance measurement with proper file output handling. Good use of RAII pattern for automatic output on destruction.
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
115-183
: LGTM! Clean interface design with pImpl pattern.The
CacheSender
andCacheReceiver
classes provide clean interfaces with proper use of the pImpl idiom for implementation hiding and ABI stability.cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (8)
1-32
: LGTM: Copyright header and includes look good.The copyright header contains the current year (2025) as required, and the new include for agent connection utilities aligns with the refactoring objectives.
38-57
: Good RAII design and helper functions.The new helper structures and functions are well-designed:
- Using declarations improve readability
ReceiveCacheResource
follows RAII principles correctlytagFromRequestId
implements proper bit manipulation for tag generationThe magic number
kDATA_TAG{43}
is appropriately named as a constexpr constant.
115-174
: Well-designed constructor and core methods.The implementation demonstrates good practices:
- Constructor properly validates inputs with
TLLM_CHECK
- Resource management uses smart pointers appropriately
- Thread safety is maintained with proper mutex usage
- Dependency injection pattern is correctly implemented
The new methods
getCounterpartsCount
andrelease
provide clean session management.
176-228
: Complex but well-structured request info handling.The method correctly handles both agent and direct connection scenarios:
- Appropriate use of
dynamic_cast
for type discrimination- Comprehensive error handling with informative messages
- Proper thread safety with mutex protection
- Session management is correctly implemented
The method's complexity is justified given the dual connection handling requirements.
230-345
: Excellent thread management and synchronization.The implementation demonstrates solid concurrent programming practices:
- Clean separation between sync and async operations
- Proper use of condition variables and mutexes for thread coordination
- Exception handling ensures promises are always resolved
- Thread naming (
dataTransResp
) aids in debugging- Producer-consumer pattern is correctly implemented
The variable renaming from responder to sender terminology is consistent with the refactor.
405-529
: Solid CacheReceiver implementation with good async handling.The receiver implementation mirrors the sender's good practices:
- Constructor properly validates inputs and manages resources
- Async operations are correctly implemented with
std::async
- Complex connection handling for both agent and direct scenarios
- Proper resource management with
ReceiveCacheResource
- Thread safety maintained throughout
The symmetry with
CacheSender
provides a clean, consistent API.
560-698
: Well-implemented async resource management.The async processing implementation demonstrates good concurrent design:
AsyncResource
properly encapsulates thread synchronization primitives- Worker thread pattern correctly processes queued requests
- Proper termination handling prevents resource leaks
- Exception handling ensures all promises are resolved
- Thread naming (
dataTransRequest
) aids debugging- Use of
std::deque
for request queue is appropriateThe cleanup logic in the destructor properly terminates all worker threads.
700-736
: Clean public interface implementation using pimpl idiom.The public interfaces are well-designed:
- Proper use of pimpl idiom with
std::unique_ptr<Impl>
- Constructor delegation to implementation is clean
- Method delegation is straightforward and maintains const correctness
- Defaulted destructors are appropriate with smart pointer members
The namespace closing comment follows the coding guidelines correctly.
eed674f
to
376d131
Compare
/bot run |
376d131
to
90d8774
Compare
PR_Github #12929 [ run ] triggered by Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)
420-424
: Consider thread pool instead of frequent thread creation.The TODO comment indicates awareness of the performance issue with frequent thread creation. The current implementation uses
std::async
withstd::launch::async
which may create new threads frequently.Consider implementing a thread pool for better performance:
- return std::async(std::launch::async, &CacheReceiver::Impl::requestSync, this, std::ref(llmRequest)); + // Use thread pool implementation instead of direct async call + return mThreadPool.submit(&CacheReceiver::Impl::requestSync, this, std::ref(llmRequest));
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
(2 hunks)cpp/tensorrt_llm/batch_manager/CMakeLists.txt
(0 hunks)cpp/tensorrt_llm/batch_manager/cacheFormatter.h
(3 hunks)cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
(7 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
(15 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.h
(3 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
(0 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h
(0 hunks)cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
(2 hunks)cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
(2 hunks)cpp/tests/batch_manager/cacheTransceiverTest.cpp
(11 hunks)
💤 Files with no reviewable changes (3)
- cpp/tensorrt_llm/batch_manager/CMakeLists.txt
- cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h
- cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
🚧 Files skipped from review as they are similar to previous changes (6)
- cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
- cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
- cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
- cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
- cpp/tensorrt_llm/batch_manager/cacheFormatter.h
- cpp/tests/batch_manager/cacheTransceiverTest.cpp
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0, nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and must be case-insensitive unique within a compilation target.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants are uppercase snakecase with prefix...
Files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
🧠 Learnings (2)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (48)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (17)
24-24
: LGTM: Include added for new cache formatter abstraction.The new include for
cacheFormatter.h
is correctly added to support the refactored architecture that usesBaseCacheFormatter
andTransferSession
from thekv_cache_manager
namespace.
42-44
: LGTM: Type aliases properly updated for the refactored architecture.The type aliases correctly reference the new abstractions:
SizeType32
properly aliased from runtime namespaceTransferSession
correctly aliased fromkv_cache_manager
namespace, replacing the previous internal definition
45-58
: LGTM: TransceiverTag struct provides proper message tagging.The
TransceiverTag
struct correctly defines:
- Enumerated message types for REQUEST_SEND and TERMINATION
- Appropriate tag constants for ID, INFO_SIZE, and INFO message components
- Uses
constexpr
for compile-time constants as per coding guidelinesThe naming follows the coding guidelines with uppercase snake_case for constants prefixed with 'k'.
58-58
: LGTM: BaseCacheFormatter type alias correctly defined.The type alias correctly references the
BaseCacheFormatter
from thekv_cache_manager
namespace, maintaining consistency with the new architecture.
115-121
: LGTM: CacheSender constructor signature properly designed.The constructor correctly accepts:
ConnectionManager*
for managing connectionsCacheState
for cache state informationSizeType32
for self indexstd::unique_ptr<BaseCacheFormatter>
for cache formatting operationsThe use of
std::unique_ptr
follows C++ best practices for single resource ownership as specified in the coding guidelines.
122-126
: LGTM: sendAsync method signature correctly updated.The method signature properly:
- Returns
std::future<void>
for asynchronous operation- Takes
LlmRequest&
as parameter- Uses
[[nodiscard]]
attribute appropriately- Includes comprehensive documentation
130-134
: LGTM: Communication state accessors properly updated.The
getCommState()
andsetCommState()
methods correctly use the newexecutor::kv_cache::CommState
type, maintaining the same interface pattern as before but with updated types.
136-150
: LGTM: Additional methods properly defined for the refactored architecture.The new methods correctly provide:
recvRequestInfo()
for receiving request informationsendSync()
for synchronous sendingsendRequestInfo()
returningTransferSession
receiveSync()
acceptingTransferSession&
All methods use appropriate parameter types and return types for the new architecture.
153-153
: LGTM: Destructor properly declared.The destructor is correctly declared and will be implemented to properly clean up resources.
160-177
: LGTM: CacheReceiver class properly designed.The
CacheReceiver
class correctly:
- Has matching constructor signature with
CacheSender
- Provides
receiveAsync()
method for asynchronous reception- Includes
sendRequestInfo()
andreceiveSync()
methods for coordination- Uses proper parameter and return types
- Follows the same design patterns as
CacheSender
24-24
: LGTM! Proper include for new dependency.The inclusion of
cacheFormatter.h
aligns with the refactoring to usekv_cache_manager
types.
42-44
: LGTM! Clean type aliasing approach.The type aliases consolidate dependencies and the
TransferSession
alias properly delegates to thekv_cache_manager
namespace as intended by the refactor.
45-57
: LGTM! Well-structured communication protocol.The
TransceiverTag
struct provides a clean enumeration of message types and associated tag constants for the communication protocol.
58-58
: LGTM! Consistent type aliasing.The
BaseCacheFormatter
alias maintains consistency with the delegation pattern tokv_cache_manager
.
160-182
: LGTM! Well-designed receiver interface.The
CacheReceiver
class interface mirrors the sender pattern appropriately, with proper constructor signature and async receive capability.
184-184
: LGTM! Proper namespace closure comment.The namespace closure comment follows the coding guidelines requirement for namespace comments.
115-158
: Constructor Validation Confirmed – No Action NeededThe
Impl
constructor already enforces non-null and consistent state checks:
- Line 127:
TLLM_CHECK(mManager);
ensuresmanager
is not null.- Line 128:
TLLM_CHECK(mManager->getCommState().getSelfIdx() == selfIndex);
verifiesselfIndex
matches the manager’s state.All public methods simply forward to
mImpl
, maintaining interface consistency.cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (31)
26-26
: LGTM: Include added for agent connection utilities.The new include for agent connection utilities is correctly added to support the refactored connection management functionality.
38-45
: LGTM: Helper function and type aliases properly implemented.The code correctly:
- Defines type aliases for the agent connection manager and data context
- Implements
tagFromRequestId()
helper function with proper bit manipulation to create unique tags- Uses
constexpr
for the constant as per coding guidelines
47-57
: LGTM: ReceiveCacheResource struct properly designed.The struct correctly:
- Manages buffer manager and CUDA event resources
- Uses move semantics in the constructor
- Follows proper resource management patterns
- Uses member initializer list as per C++ best practices
115-132
: LGTM: CacheSender::Impl constructor properly implemented.The constructor correctly:
- Initializes all member variables using member initializer list
- Performs proper null pointer checks with
TLLM_CHECK
- Validates self index consistency
- Sets up CUDA device context
- Launches the response thread asynchronously
The use of
std::make_shared
forCudaStream
follows smart pointer best practices.
134-149
: LGTM: sendAsync method properly implemented with thread synchronization.The method correctly:
- Creates promise/future pair for asynchronous operation
- Uses proper mutex locking for thread safety
- Adds requests to the ready responses queue
- Notifies waiting threads via condition variable
- Returns the future for caller synchronization
The nested locking approach prevents race conditions.
230-237
: LGTM: sendSync method properly implemented.The method correctly:
- Looks up the transfer session by request ID
- Sets the LLM request on the session
- Delegates formatting to the formatter
- Includes proper error checking
251-265
: LGTM: sendAndRemoveResponse method with proper exception handling.The method correctly:
- Uses CUDA device setting for proper context
- Calls synchronous send and cleanup
- Provides proper exception handling with promise notification
- Uses noexcept specification appropriately
267-345
: LGTM: Response thread implementation with comprehensive synchronization.The response thread correctly:
- Sets thread name for debugging
- Uses proper CUDA device context
- Implements condition variable wait with proper predicates
- Handles termination gracefully
- Manages request counting and cleanup
- Supports both parallel and sequential sending modes
- Includes comprehensive error handling and logging
The thread synchronization logic appears sound with proper use of mutexes and condition variables.
408-418
: LGTM: CacheReceiver::Impl constructor properly implemented.The constructor follows the same pattern as
CacheSender::Impl
with:
- Proper member initialization
- Null pointer validation
- Self index consistency check
- CUDA device setup
460-463
: LGTM: receiveSync method properly delegates to formatter.The method correctly delegates the unformatting operation to the formatter using the transfer session, maintaining proper separation of concerns.
465-529
: LGTM: sendRequestInfo method with comprehensive connection handling.The method correctly:
- Validates formatter support for cache states
- Handles selective cache transfer optimization
- Supports both agent and non-agent connection managers
- Manages buffer index assignment for agent connections
- Creates transfer session with proper parameters
- Includes comprehensive error checking and validation
The dual path handling for agent vs. non-agent connections is properly implemented.
531-547
: LGTM: getReceiveCacheResource method with proper resource management.The method correctly:
- Uses scoped lock for thread safety
- Implements per-process resource isolation when concurrent mode is enabled
- Creates resources on-demand with proper RAII
- Uses default process string for non-concurrent mode
549-559
: LGTM: sendRequestInfo helper method properly implemented.The helper method correctly:
- Serializes request info using the established serialization utilities
- Sends data in the proper order (ID, size, then data)
- Uses appropriate data contexts for different message components
700-704
: LGTM: CacheSender public constructor properly implemented.The constructor correctly creates the implementation with perfect forwarding of the formatter parameter using
std::move
.
706-721
: LGTM: CacheSender public interface methods properly implemented.All public methods correctly delegate to the implementation:
sendAsync()
forwards the request properlygetCommState()
andsetCommState()
maintain the interface contract- Destructor is properly defaulted
723-734
: LGTM: CacheReceiver public interface properly implemented.The constructor and methods correctly:
- Create implementation with proper parameter forwarding
- Delegate
receiveAsync()
to the multi-threaded implementation- Use defaulted destructor for proper cleanup
176-228
: Dynamic_cast usage is safe – no action neededVerified that
ConnectionManager
is indeed polymorphic (it declares virtual methods, includingrecvConnect
,getConnections
, andgetCommState
), so thedynamic_cast
inrecvRequestInfo()
will behave as intended. The interface is stable and handles both agent and non-agent scenarios correctly.
26-26
: LGTM! Updated include for agent utilities.The include path update aligns with the refactored cache transmission structure.
38-45
: LGTM! Clean helper function with proper naming.The
tagFromRequestId
function follows proper naming conventions and provides a clean abstraction for generating unique tags from request IDs.
47-57
: LGTM! Well-structured resource management.The
ReceiveCacheResource
struct properly encapsulates buffer manager and CUDA event resources with appropriate constructor.
134-149
: LGTM! Proper async pattern with promise/future.The
sendAsync
method correctly implements the promise/future pattern with proper synchronization using mutexes and condition variables.
176-228
: LGTM! Comprehensive request info handling with proper validation.The
recvRequestInfo
method properly handles both agent and direct connection scenarios, with appropriate validation of cache state compatibility and connection management.
230-237
: LGTM! Clean synchronous send implementation.The
sendSync
method properly retrieves the session and delegates to the formatter, with appropriate error checking.
405-418
: LGTM! Consistent constructor pattern.The
CacheReceiver::Impl
constructor follows the same validation and initialization pattern as the sender.
465-529
: LGTM! Comprehensive request info sending with agent support.The
sendRequestInfo
method properly handles both agent and direct connection scenarios, with appropriate buffer management and connection selection logic.
531-547
: LGTM! Thread-safe resource management with proper scoping.The
getReceiveCacheResource
method uses proper scoped locking and resource management patterns for concurrent access.
549-559
: LGTM! Clean request info serialization.The
sendRequestInfo
helper method properly serializes and sends request information using the defined communication protocol.
700-735
: LGTM! Clean public interface implementation.The public interface methods properly delegate to the pImpl pattern implementations with appropriate parameter forwarding and resource management.
736-736
: LGTM! Proper namespace closure comment.The namespace closure comment follows the coding guidelines requirement.
267-345
: Thread name and exception handling verifiedThe thread name “dataTransResp” is only set here and does not conflict with other threads. Exception propagation via
set_exception(std::current_exception())
matches the pattern used elsewhere (e.g., inworkerPool.h
). No further changes are needed.
115-132
: Ensure ConnectionManager Access Is Thread-SafeI didn’t find any explicit synchronization (mutexes/locks) around calls to
mManager
in this class. Since you spawn a background thread viastd::async
forImpl::response
, concurrent calls tomManager
could race unlessConnectionManager
itself is fully thread-safe.Please confirm or address this:
• Constructor (lines 115–132): validates and reads from
mManager
• Impl::response (launched async): invokesmManager->recvConnect(...)
and other methods
•receiveAsync()
/other methods: callmManager->getConnections(...)
If
executor::kv_cache::ConnectionManager
is not inherently thread-safe, wrap all shared calls in a mutex or otherwise serialize access tomManager
.
[[nodiscard]] size_t getCounterpartsCount(LlmRequest::RequestIdType requestId) const | ||
{ | ||
auto it = mRequestToSession.find(requestId); | ||
TLLM_CHECK(it != mRequestToSession.end()); | ||
return it->second.getConnections().size(); | ||
} | ||
|
||
void release(LlmRequest::RequestIdType requestId) | ||
{ | ||
auto it = mRequestToSession.find(requestId); | ||
TLLM_CHECK(it != mRequestToSession.end()); | ||
std::unique_lock<std::mutex> lk(mMtxForMap); | ||
mRequestToSession.erase(it); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Check thread safety of session map access.
The getCounterpartsCount
and release
methods access mRequestToSession
map. Verify that all map accesses are properly synchronized.
[[nodiscard]] size_t getCounterpartsCount(LlmRequest::RequestIdType requestId) const
{
+ std::unique_lock<std::mutex> lk(mMtxForMap);
auto it = mRequestToSession.find(requestId);
TLLM_CHECK(it != mRequestToSession.end());
return it->second.getConnections().size();
}
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp around lines 161 to 174,
the methods getCounterpartsCount and release access the mRequestToSession map
without synchronization, which can cause race conditions. To fix this, ensure
that all accesses to mRequestToSession are protected by acquiring the mutex
mMtxForMap before accessing or modifying the map. Add a
std::unique_lock<std::mutex> lock(mMtxForMap) at the start of
getCounterpartsCount and confirm the existing lock in release covers the map
access properly.
PR_Github #12929 [ run ] completed with state |
ddc66c6
to
7d3e452
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (2)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
18-18
: Replace #pragma once with required preprocessor guard.The coding guidelines specifically require preprocessor guards in header files with the format
TRTLLM_
followed by the filename in caps.Replace the current directive with the required guard format:
-#pragma once +#ifndef TRTLLM_DATATRANSCEIVER_H +#define TRTLLM_DATATRANSCEIVER_HAnd add the closing directive at the end of the file:
} // namespace tensorrt_llm::batch_manager + +#endif // TRTLLM_DATATRANSCEIVER_Hcpp/tensorrt_llm/batch_manager/cacheFormatter.h (1)
225-261
: Fix member variable naming convention.The member variable
kvCacheMeasureHelper
should follow the coding guidelines for class member variables.Apply this fix:
- KvCacheMeasureHelper kvCacheMeasureHelper{common::getEnvKVCacheTransferOutputPath()}; + KvCacheMeasureHelper mKvCacheMeasureHelper{common::getEnvKVCacheTransferOutputPath()};Class member variables should use camelCase prefixed with 'm' according to the coding guidelines.
♻️ Duplicate comments (2)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)
161-174
: Thread safety issue with session map access.The
getCounterpartsCount
method accessesmRequestToSession
without proper synchronization, whilerelease
uses mutex protection. This creates a race condition.This is the same thread safety issue identified in previous reviews. The
getCounterpartsCount
method needs mutex protection:[[nodiscard]] size_t getCounterpartsCount(LlmRequest::RequestIdType requestId) const { + std::unique_lock<std::mutex> lk(mMtxForMap); auto it = mRequestToSession.find(requestId); TLLM_CHECK(it != mRequestToSession.end()); return it->second.getConnections().size(); }
420-424
: TODO: Address inefficient thread creation pattern.The method creates a new thread for each async receive operation, which is inefficient as noted in the TODO comment.
This matches the previous review feedback about frequent thread creation. The TODO comment correctly identifies this as needing improvement. Consider using the existing
requestAndReceiveAsyncMultiThreads
pattern or implementing a thread pool.
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
(2 hunks)cpp/tensorrt_llm/batch_manager/CMakeLists.txt
(0 hunks)cpp/tensorrt_llm/batch_manager/cacheFormatter.h
(3 hunks)cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
(7 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
(15 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.h
(3 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
(0 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h
(0 hunks)cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
(2 hunks)cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
(2 hunks)cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp
(1 hunks)cpp/tests/batch_manager/cacheTransceiverTest.cpp
(11 hunks)cpp/tests/unit_tests/executor/ucxCommTest.cpp
(0 hunks)
💤 Files with no reviewable changes (4)
- cpp/tensorrt_llm/batch_manager/CMakeLists.txt
- cpp/tests/unit_tests/executor/ucxCommTest.cpp
- cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h
- cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
✅ Files skipped from review due to trivial changes (1)
- cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp
🚧 Files skipped from review as they are similar to previous changes (5)
- cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
- cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
- cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
- cpp/tests/batch_manager/cacheTransceiverTest.cpp
- cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0, nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and must be case-insensitive unique within a compilation target.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants are uppercase snakecase with prefix...
Files:
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
**/*.{h,hpp}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.
Files:
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
🧠 Learnings (3)
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)
Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache()
and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
🔇 Additional comments (24)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (6)
1-16
: LGTM! Copyright header complies with coding guidelines.The NVIDIA copyright header includes the current year (2025) as required by the coding guidelines for TensorRT-LLM Open Source Software.
24-24
: Good addition of cacheFormatter.h include.The new include aligns with the refactored design that moves TransferSession definition to the kv_cache_manager namespace.
42-58
: Well-designed TransceiverTag structure follows coding guidelines.The type aliases and TransceiverTag struct properly follow C++ naming conventions:
- SizeType32 alias uses camelCase as required
- TransferSession alias provides clear abstraction boundary
- Enum values use uppercase naming (REQUEST_SEND, TERMINATION)
- Static constexpr constants use kCONSTANT_NAME pattern with proper k prefix
115-158
: CacheSender class design follows best practices.The refactored class demonstrates good design principles:
- Proper UpperCamelCase naming for class name
- Method names use lowerCamelCase convention
- Smart pointer usage with unique_ptr for resource management
- Pimpl pattern reduces compilation dependencies
- Constructor parameters use appropriate types (ConnectionManager*, CacheState, etc.)
160-182
: CacheReceiver class maintains design consistency.The class follows the same excellent design patterns as CacheSender:
- Consistent naming conventions and interface design
- Symmetric constructor signature with CacheSender
- Appropriate method names for receiving operations
- Proper use of pimpl pattern for implementation hiding
184-184
: Perfect namespace closing comment.The closing brace comment follows the coding guidelines requirement exactly:
} // namespace tensorrt_llm::batch_manager
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (4)
1-16
: Copyright header meets coding guidelines.The NVIDIA copyright header correctly includes the current year (2025) as required for TensorRT-LLM Open Source Software.
45-121
: TransferSession class demonstrates excellent C++ design practices.The class properly follows all coding guidelines:
- UpperCamelCase class name
- Member variables with proper 'm' prefix (mConnections, mDataContext, etc.)
- Method names use lowerCamelCase convention
- Proper const-correctness with const methods and const references
- Good use of TLLM_CHECK for assertions
- Clean initializer list in constructor
123-161
: BaseCacheFormatter interface demonstrates clean design.The interface properly follows coding guidelines:
- Method names use lowerCamelCase convention
- Proper virtual destructor
- Clean signature with TransferSession& parameter
- Appropriate const-correctness
- Good use of [[nodiscard]] attribute for return values
266-266
: Perfect namespace closing comment.The closing brace comment properly follows coding guidelines with the complete namespace path:
} // namespace tensorrt_llm::batch_manager::kv_cache_manager
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (14)
1-16
: Copyright header complies with coding guidelines.The NVIDIA copyright header correctly includes the current year (2025) as required for TensorRT-LLM Open Source Software.
38-57
: Well-designed utility functions and structures.The code follows C++ coding guidelines properly:
tagFromRequestId
function uses lowerCamelCase and static linkageReceiveCacheResource
struct uses UpperCamelCase naming- Member variables correctly use 'm' prefix (mBufferManager, mCudaEvent)
- Constant
kDATA_TAG
follows uppercase with 'k' prefix convention
115-133
: CacheSender::Impl constructor follows best practices.The constructor demonstrates good C++ practices:
- Proper member initialization list
- Smart pointer usage with unique_ptr
- Appropriate validation with TLLM_CHECK
- Member variables follow 'm' prefix convention
176-228
: Complex but well-structured request info handling.The method demonstrates good practices:
- Proper type checking with dynamic_cast for agent connections
- Clean lambda usage for agent-specific logic
- Correct mutex synchronization for
mRequestToSession
access- Good validation of formatter compatibility between cache states
- Appropriate session creation and connection management
230-237
: Clean and focused sendSync implementation.The method properly delegates to formatter and manages the session state appropriately.
251-403
: Robust threading and response management.The response handling demonstrates excellent threading practices:
- Proper exception handling with promise/future pattern
- Correct CUDA device management across threads
- Thread naming for debugging (
dataTransResp
)- Proper condition variable usage for synchronization
- Clean termination handling in destructor
405-418
: CacheReceiver::Impl constructor follows consistent design.The constructor maintains the same excellent practices as CacheSender::Impl with proper initialization and validation.
426-458
: Excellent async request handling with thread pool pattern.This method demonstrates the proper solution for efficient async operations:
- Resource pooling per process to avoid thread creation overhead
- Queue-based task distribution with worker threads
- Proper synchronization with mutex and condition variables
- Good exception handling and future management
- This addresses the inefficiency noted in the
receiveAsync
TODO
460-463
: Clean receiveSync delegation.Simple and appropriate delegation to the formatter's unformat method.
465-559
: Comprehensive sendRequestInfo implementation.The method handles complex scenarios well:
- Proper validation of formatter support between cache states
- Clean handling of both agent and regular connection managers
- Appropriate serialization of request information
- Good resource management with TransferSession creation
- Proper exception handling and validation throughout
531-559
: Solid resource management with thread safety.The resource management demonstrates good practices:
- Thread-safe access with
scoped_lock<mutex>
- Lazy initialization of per-process resources
- Proper RAII with unique_ptr and BufferManager
- Clean helper method for connection communication
561-698
: Excellent async resource management and cleanup.The implementation demonstrates robust design:
- Proper RAII in destructor with resource cleanup and future synchronization
- Good CUDA device management and logging in
requestSync
- Well-designed
AsyncResource
structure for thread coordination- Proper move semantics in
RequestAndPromise
- Comprehensive error handling with exception propagation
- Thread naming for debugging (
dataTransRequest
)
700-734
: Clean public interface following pimpl pattern.The public interface demonstrates excellent design:
- Proper delegation to implementation classes
- Clean constructor forwarding with smart pointer management
- Appropriate method delegation maintaining interface contracts
- Correctly defaulted destructors for pimpl pattern
736-737
: Perfect namespace closing comment.The closing brace comment properly follows coding guidelines:
} // namespace tensorrt_llm::batch_manager
@@ -18,11 +18,11 @@ | |||
#pragma once |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Replace #pragma once with required preprocessor guard.
The coding guidelines mandate preprocessor guards in header files with the format TRTLLM_
followed by the filename in caps.
Replace with the required guard format:
-#pragma once
+#ifndef TRTLLM_CACHEFORMATTER_H
+#define TRTLLM_CACHEFORMATTER_H
And add the closing directive at the end of the file:
} // namespace tensorrt_llm::batch_manager::kv_cache_manager
+
+#endif // TRTLLM_CACHEFORMATTER_H
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.h at line 18, replace the
#pragma once directive with a preprocessor guard following the coding
guidelines. Use the format TRTLLM_CACHEFORMATTER_H for the guard macro. Add
#ifndef TRTLLM_CACHEFORMATTER_H and #define TRTLLM_CACHEFORMATTER_H at the top
of the file, and add the corresponding #endif at the end of the file to close
the guard.
class KvCacheMeasureHelper | ||
{ | ||
public: | ||
KvCacheMeasureHelper(std::string output_path) | ||
: mOutputPath(std::move(output_path)) | ||
{ | ||
} | ||
|
||
void appendKVCacheTransfer(LlmRequest::RequestIdType requestId, double duration, size_t size) | ||
{ | ||
auto bandwidth = size * 8 / (duration / 1000) / 1e9; | ||
if (mOutputPath.empty()) | ||
{ | ||
return; | ||
} | ||
|
||
std::lock_guard<std::mutex> lock(mMutex); | ||
mRequestKVCacheTranfserMeasure[requestId].emplace_back(duration, bandwidth); | ||
} | ||
|
||
~KvCacheMeasureHelper() | ||
{ | ||
if (!mRequestKVCacheTranfserMeasure.empty() && !mOutputPath.empty()) | ||
{ | ||
auto rank = mpi::MpiComm::world().getRank(); | ||
std::string outFilePath = mOutputPath + "rank_" + std::to_string(rank) + ".txt"; | ||
std::ofstream outFile(outFilePath); | ||
|
||
TLLM_CHECK_WITH_INFO(outFile.is_open(), "Cannot write to file " + outFilePath); | ||
|
||
size_t numTransferMeasure = mRequestKVCacheTranfserMeasure.begin()->second.size(); | ||
|
||
outFile << "RequestID"; | ||
for (size_t i = 0; i < numTransferMeasure; i++) | ||
{ | ||
outFile << ",TimeDuration,Bandwidth"; | ||
} | ||
outFile << '\n'; | ||
|
||
for (auto const& [requestID, measures] : mRequestKVCacheTranfserMeasure) | ||
{ | ||
outFile << requestID; | ||
|
||
for (auto const& [time, bandwidth] : measures) | ||
{ | ||
outFile << "," << time << "," << bandwidth; | ||
} | ||
outFile << '\n'; | ||
} | ||
|
||
outFile.close(); | ||
} | ||
} | ||
|
||
private: | ||
std::map<LlmRequest::RequestIdType, std::vector<std::pair<double, double>>> mRequestKVCacheTranfserMeasure; | ||
std::string mOutputPath; | ||
std::mutex mMutex; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Fix typo in member variable name and improve parameter naming.
The class implementation is good but has minor naming issues:
Apply these fixes:
-class KvCacheMeasureHelper
-{
-public:
- KvCacheMeasureHelper(std::string output_path)
- : mOutputPath(std::move(output_path))
+class KvCacheMeasureHelper
+{
+public:
+ KvCacheMeasureHelper(std::string outputPath)
+ : mOutputPath(std::move(outputPath))
- std::map<LlmRequest::RequestIdType, std::vector<std::pair<double, double>>> mRequestKVCacheTranfserMeasure;
+ std::map<LlmRequest::RequestIdType, std::vector<std::pair<double, double>>> mRequestKVCacheTransferMeasure;
And update the corresponding usage in the destructor.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.h between lines 163 and 221,
fix the typo in the member variable name from mRequestKVCacheTranfserMeasure to
mRequestKVCacheTransferMeasure consistently throughout the class, including in
appendKVCacheTransfer and the destructor. Also, rename the parameters in
appendKVCacheTransfer to more descriptive names such as requestId, durationMs,
and sizeBytes for clarity. Update all usages of these variables accordingly.
PR_Github #13033 [ run ] triggered by Bot |
PR_Github #13033 [ run ] completed with state |
Signed-off-by: Iman Tabrizian <[email protected]>
Signed-off-by: Iman Tabrizian <[email protected]>
7d3e452
to
1a21dba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🔭 Outside diff range comments (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
17-17
: Replace #pragma once with required preprocessor guard.Header files in this project must use include guards named
TRTLLM_<FILENAME>_H
without underscores prefix/suffix.Replace with the required guard format:
-#pragma once +#ifndef TRTLLM_KVCACHEUTILS_H +#define TRTLLM_KVCACHEUTILS_HAnd add the closing directive at the end of the file:
} // namespace tensorrt_llm::batch_manager::kv_cache_manager + +#endif // TRTLLM_KVCACHEUTILS_H
♻️ Duplicate comments (4)
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (2)
18-18
: Replace #pragma once with required preprocessor guard.The coding guidelines mandate preprocessor guards in header files with the format
TRTLLM_
followed by the filename in caps.Replace with the required guard format:
-#pragma once +#ifndef TRTLLM_CACHEFORMATTER_H +#define TRTLLM_CACHEFORMATTER_HAnd add the closing directive at the end of the file:
} // namespace tensorrt_llm::batch_manager::kv_cache_manager + +#endif // TRTLLM_CACHEFORMATTER_H
111-111
: Fix typo in member variable name.The member variable name contains a typo: "Tranfser" should be "Transfer".
Apply this fix:
- std::map<LlmRequest::RequestIdType, std::vector<Measure>> mRequestKVCacheTranfserMeasure; + std::map<LlmRequest::RequestIdType, std::vector<Measure>> mRequestKVCacheTransferMeasure;Also update the usage in line 71:
- mRequestKVCacheTranfserMeasure[requestId].emplace_back(Measure{delay, duration, bandwidth}); + mRequestKVCacheTransferMeasure[requestId].emplace_back(Measure{delay, duration, bandwidth});And in the destructor (lines 76 and 86):
- if (!mRequestKVCacheTranfserMeasure.empty() && !mOutputPath.empty()) + if (!mRequestKVCacheTransferMeasure.empty() && !mOutputPath.empty())- size_t numTransferMeasure = mRequestKVCacheTranfserMeasure.begin()->second.size(); + size_t numTransferMeasure = mRequestKVCacheTransferMeasure.begin()->second.size();And line 95:
- for (auto const& [requestID, measures] : mRequestKVCacheTranfserMeasure) + for (auto const& [requestID, measures] : mRequestKVCacheTransferMeasure)cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)
161-174
: Thread safety issue ingetCounterpartsCount
method.The method accesses
mRequestToSession
without acquiring the mutex, which could cause race conditions with concurrent modifications.[[nodiscard]] size_t getCounterpartsCount(LlmRequest::RequestIdType requestId) const { + std::unique_lock<std::mutex> lk(mMtxForMap); auto it = mRequestToSession.find(requestId); TLLM_CHECK(it != mRequestToSession.end()); return it->second.getConnections().size(); }
430-434
: TODO: Address frequent thread creation issue.The current implementation creates a new thread for each async receive operation, which is inefficient as noted in the TODO comment.
Consider using the existing
requestAndReceiveAsyncMultiThreads
method or implementing a thread pool:[[nodiscard]] std::future<void> receiveAsync(LlmRequest& llmRequest) { - // TODO: Modify the implementation here to avoid frequent thread creation. - return std::async(std::launch::async, &CacheReceiver::Impl::requestSync, this, std::ref(llmRequest)); + return requestAndReceiveAsyncMultiThreads(llmRequest); }
🧹 Nitpick comments (4)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)
42-44
: Consider improving function signature consistency.The function now accepts
blockHashes
but internally converts them to block IDs in line 55. Consider either:
- Renaming the parameter to better reflect its dual purpose
- Accepting block IDs directly if the caller already has them
Consider this alternative approach for better clarity:
BlockRange getBlockRangeForSending( - BaseKVCacheManager* cacheManager, LlmRequest const& llmRequest, std::vector<size_t> const& blockHashes) + BaseKVCacheManager* cacheManager, LlmRequest const& llmRequest, std::vector<size_t> const& blockHashesOrIds)cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)
1-2
: Update copyright year to 2025 for new methods.The file header indicates copyright ending in 2024, but the changes are being made in 2025. Consider updating the copyright year to reflect when these new methods are being added.
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
45-56
: Consider making tag constants more maintainable.The magic numbers for tags (19, 22, 32) should be documented or derived from a base value to make their purpose clearer.
struct TransceiverTag { enum class Id : uint64_t { REQUEST_SEND = 1, TERMINATION = 2 }; - static constexpr int32_t kID_TAG{19}; - static constexpr int32_t kINFO_SIZE_TAG{22}; - static constexpr int32_t kINFO_TAG{32}; + // Base tag for transceiver operations + static constexpr int32_t kBASE_TAG{19}; + static constexpr int32_t kID_TAG{kBASE_TAG}; + static constexpr int32_t kINFO_SIZE_TAG{kBASE_TAG + 3}; + static constexpr int32_t kINFO_TAG{kBASE_TAG + 13}; };cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)
475-537
: Complex connection management insendRequestInfo
.The method handles both regular and agent-based connections with extensive branching. Consider extracting the agent-specific logic to a separate method for better maintainability.
TransferSession sendRequestInfo(LlmRequest const& llmRequest) { // ... initial setup ... - auto* agentConnectionManager = dynamic_cast<executor::kv_cache::AgentConnectionManager*>(mManager); - std::optional<size_t> cacheBufferId = std::nullopt; - if (agentConnectionManager != nullptr) - { - cacheBufferId = agentConnectionManager->getCacheTransBufferManager()->assignBufferIndexForRecv(); - TLLM_CHECK(cacheBufferId.has_value()); - // memory Desp , validSegmentIdx send - } + auto cacheBufferId = prepareAgentCacheBuffer(); // ... connection setup ... for (size_t i = 0; i < counterPartConnections.size(); i++) { auto const* connection = counterPartConnections[i]; - // if Manager is agentConnectionManager, then send request info to agent - auto* agentConnectionManager = dynamic_cast<executor::kv_cache::AgentConnectionManager*>(mManager); - if (agentConnectionManager != nullptr) - { - // TODO: index -> validConnectionIdx conversion - auto valideConnectionIdx = std::find(pickUpIdx.begin(), pickUpIdx.end(), i) - pickUpIdx.begin(); - auto* agentConnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(connection); - TLLM_CHECK(agentConnection != nullptr); - TLLM_CHECK(cacheBufferId.has_value()); - const_cast<executor::kv_cache::AgentConnection*>(agentConnection) - ->sendRequestAndBufferInfo(requestInfo, cacheBufferId, valideConnectionIdx); - } - else - { - sendRequestInfo(connection, requestInfo); - } + sendRequestInfoToConnection(connection, requestInfo, cacheBufferId, pickUpIdx, i); } // ... rest of method ... } +private: + std::optional<size_t> prepareAgentCacheBuffer() + { + auto* agentManager = dynamic_cast<executor::kv_cache::AgentConnectionManager*>(mManager); + if (agentManager != nullptr) + { + auto bufferId = agentManager->getCacheTransBufferManager()->assignBufferIndexForRecv(); + TLLM_CHECK(bufferId.has_value()); + return bufferId; + } + return std::nullopt; + } + + void sendRequestInfoToConnection(executor::kv_cache::Connection const* connection, + RequestInfo const& requestInfo, std::optional<size_t> cacheBufferId, + std::vector<size_t> const& pickUpIdx, size_t connectionIdx) + { + if (cacheBufferId.has_value()) + { + auto validIdx = std::find(pickUpIdx.begin(), pickUpIdx.end(), connectionIdx) - pickUpIdx.begin(); + auto* agentConn = dynamic_cast<executor::kv_cache::AgentConnection const*>(connection); + TLLM_CHECK(agentConn != nullptr); + const_cast<executor::kv_cache::AgentConnection*>(agentConn) + ->sendRequestAndBufferInfo(requestInfo, cacheBufferId, validIdx); + } + else + { + sendRequestInfo(connection, requestInfo); + } + }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
(2 hunks)cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(5 hunks)cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
(1 hunks)cpp/include/tensorrt_llm/batch_manager/llmRequest.h
(0 hunks)cpp/tensorrt_llm/batch_manager/CMakeLists.txt
(0 hunks)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
(3 hunks)cpp/tensorrt_llm/batch_manager/cacheFormatter.h
(2 hunks)cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
(7 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
(12 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.h
(3 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
(0 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h
(0 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(3 hunks)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
(2 hunks)cpp/tensorrt_llm/common/envUtils.cpp
(0 hunks)cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
(2 hunks)cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
(2 hunks)cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp
(1 hunks)
💤 Files with no reviewable changes (5)
- cpp/tensorrt_llm/batch_manager/CMakeLists.txt
- cpp/tensorrt_llm/common/envUtils.cpp
- cpp/include/tensorrt_llm/batch_manager/llmRequest.h
- cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h
- cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
🚧 Files skipped from review as they are similar to previous changes (5)
- cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
- cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
- cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp
- cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
- cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
🧠 Learnings (3)
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.
Applied to files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
📚 Learning: 2025-08-12T10:28:57.287Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.287Z
Learning: Applies to **/*.{h,hpp,hxx,hh,cuh} : Header files must use include guards named TRTLLM_<FILENAME>_H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Applied to files:
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.
Applied to files:
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (12)
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (1)
52-53
: Constructor parameter name should follow coding guidelines.Parameter names should use lowerCamelCase without underscores according to the coding guidelines.
Apply this fix:
- KvCacheMeasureHelper(std::string output_path) - : mOutputPath(std::move(output_path)) + KvCacheMeasureHelper(std::string outputPath) + : mOutputPath(std::move(outputPath))cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)
96-97
: LGTM! Correctly integrates session-based block hash transfer.The implementation properly retrieves block hashes from the session and passes them to the updated
getBlockRangeForSending
function, aligning with the new session-based transfer architecture.cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
1011-1015
: LGTM! Properly implements block hash lookup in reuse tree.The implementation correctly delegates to the cached blocks root to find matching blocks by hash, supporting the partial reuse and copy-on-partial-reuse configurations.
1468-1474
: LGTM! Clean implementation of block pinning across window managers.The implementation correctly iterates through all window managers to pin blocks for the given sequence, properly propagating the pinning request throughout the hierarchy.
1476-1484
: LGTM! Correctly increments reference counts for pinned blocks.The implementation properly pins all allocated blocks for a sequence by incrementing their reference counts, which will prevent them from being evicted.
2117-2121
: LGTM! Clean public API for block pinning.The implementation provides a simple and clean interface for pinning blocks by request ID, properly retrieving the sequence and delegating to the block manager.
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)
158-159
: LGTM! Correctly integrates block hashes from session.The implementation properly retrieves block hashes from the session and passes them to the updated
getBlockRangeForSending
function.
49-56
: Selective transfer gating intentionally replaced
TheDisableSelectiveCacheTransfer
env‐var check has been removed and gating is now driven bypoolNum
andisEnableBlockReuse()
. We confirmed:
getEnvDisableSelectiveCacheTransfer()
is only declared (no usages)- In getBlockRangeForSending:
• IfpoolNum>1
or!isEnableBlockReuse()
, all blocks are sent
• Otherwise (single pool & reuse enabled), only the requestedblockHashes
are sent- The receiving side (getBlockRangeForReceiving) uses the same single‐pool/reuse condition for symmetry
No changes are needed—the new logic is intentional and covers all scenarios.
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (2)
1-16
: Copyright header updated to 2025 - looks good!The copyright header correctly reflects the 2025 date for this new implementation.
115-158
: Well-structured PIMPL implementation forCacheSender
.The class follows the PIMPL idiom correctly with clear method signatures and proper const-correctness. The documentation is comprehensive.
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)
240-277
: Good error handling insendAndRemoveResponse
.The method properly handles exceptions and ensures promises are fulfilled or rejected appropriately. The noexcept specification is correct since exceptions are caught internally.
539-555
: Resource management looks solid.The
getReceiveCacheResource
method properly manages per-process resources with appropriate synchronization and lazy initialization.
void pinBlocks(GenerationRequest& sequence); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add documentation for the new pinBlocks
method.
The new pinBlocks
method lacks documentation explaining its purpose, behavior, and when it should be called. This is particularly important as it appears to be part of the public API.
Add documentation before the method declaration:
+ //! \brief Pin blocks associated with a sequence to prevent eviction.
+ //! \param sequence The generation request whose blocks should be pinned.
+ //! \details This method marks blocks as pinned in the KV cache to ensure they
+ //! remain available for reuse across multiple requests.
void pinBlocks(GenerationRequest& sequence);
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
void pinBlocks(GenerationRequest& sequence); | |
//! \brief Pin blocks associated with a sequence to prevent eviction. | |
//! \param sequence The generation request whose blocks should be pinned. | |
//! \details This method marks blocks as pinned in the KV cache to ensure they | |
//! remain available for reuse across multiple requests. | |
void pinBlocks(GenerationRequest& sequence); |
🤖 Prompt for AI Agents
In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around lines 569-570,
the public method declaration void pinBlocks(GenerationRequest& sequence); is
missing documentation; add a brief doxygen-style comment immediately above the
declaration describing purpose (what pinning does), expected behavior (which
blocks are pinned, lifetime and thread-safety expectations), when callers should
invoke it (e.g., before generation, after allocation), parameters (explain
GenerationRequest reference and ownership/constness), return/exception behavior
(whether it can fail, how failures are reported), and any side effects (impact
on memory and eviction). Ensure the comment is concise, uses @param and
@throws/@note as appropriate, and matches the style of surrounding method docs
in this header.
[[nodiscard]] WindowBlockManager::BlockMapIterRange getBlocksByHash(size_t hash, SizeType32 windowSize) const | ||
{ | ||
return mWindowBlockManagers.at(windowSize).getBlocksByHash(hash); | ||
} | ||
|
||
[[nodiscard]] BlockPtr const& getBlockFromReuseTreeByHash(size_t hash, SizeType32 windowSize) const | ||
{ | ||
return mWindowBlockManagers.at(windowSize).getBlockFromReuseTreeByHash(hash); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add bounds checking and documentation for new hash-based lookup methods.
The new hash-based lookup methods getBlocksByHash
and getBlockFromReuseTreeByHash
access the window manager map without bounds checking and lack documentation.
+ //! \brief Get blocks matching the given hash for a specific window size.
+ //! \param hash The hash value to search for.
+ //! \param windowSize The window size to search within.
+ //! \return Iterator range over matching blocks.
[[nodiscard]] WindowBlockManager::BlockMapIterRange getBlocksByHash(size_t hash, SizeType32 windowSize) const
{
+ auto it = mWindowBlockManagers.find(windowSize);
+ TLLM_CHECK_WITH_INFO(it != mWindowBlockManagers.end(),
+ "Window size %d not found in window block managers", windowSize);
- return mWindowBlockManagers.at(windowSize).getBlocksByHash(hash);
+ return it->second.getBlocksByHash(hash);
}
+ //! \brief Get a block from the reuse tree matching the given hash.
+ //! \param hash The hash value to search for.
+ //! \param windowSize The window size to search within.
+ //! \return Reference to the matching block pointer.
[[nodiscard]] BlockPtr const& getBlockFromReuseTreeByHash(size_t hash, SizeType32 windowSize) const
{
+ auto it = mWindowBlockManagers.find(windowSize);
+ TLLM_CHECK_WITH_INFO(it != mWindowBlockManagers.end(),
+ "Window size %d not found in window block managers", windowSize);
- return mWindowBlockManagers.at(windowSize).getBlockFromReuseTreeByHash(hash);
+ return it->second.getBlockFromReuseTreeByHash(hash);
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
[[nodiscard]] WindowBlockManager::BlockMapIterRange getBlocksByHash(size_t hash, SizeType32 windowSize) const | |
{ | |
return mWindowBlockManagers.at(windowSize).getBlocksByHash(hash); | |
} | |
[[nodiscard]] BlockPtr const& getBlockFromReuseTreeByHash(size_t hash, SizeType32 windowSize) const | |
{ | |
return mWindowBlockManagers.at(windowSize).getBlockFromReuseTreeByHash(hash); | |
} | |
//! \brief Get blocks matching the given hash for a specific window size. | |
//! \param hash The hash value to search for. | |
//! \param windowSize The window size to search within. | |
//! \return Iterator range over matching blocks. | |
[[nodiscard]] WindowBlockManager::BlockMapIterRange getBlocksByHash(size_t hash, SizeType32 windowSize) const | |
{ | |
auto it = mWindowBlockManagers.find(windowSize); | |
TLLM_CHECK_WITH_INFO(it != mWindowBlockManagers.end(), | |
"Window size %d not found in window block managers", windowSize); | |
return it->second.getBlocksByHash(hash); | |
} | |
//! \brief Get a block from the reuse tree matching the given hash. | |
//! \param hash The hash value to search for. | |
//! \param windowSize The window size to search within. | |
//! \return Reference to the matching block pointer. | |
[[nodiscard]] BlockPtr const& getBlockFromReuseTreeByHash(size_t hash, SizeType32 windowSize) const | |
{ | |
auto it = mWindowBlockManagers.find(windowSize); | |
TLLM_CHECK_WITH_INFO(it != mWindowBlockManagers.end(), | |
"Window size %d not found in window block managers", windowSize); | |
return it->second.getBlockFromReuseTreeByHash(hash); | |
} |
🤖 Prompt for AI Agents
In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around lines
1074-1083, the two new methods call mWindowBlockManagers.at(windowSize) without
validating that windowSize exists and they lack documentation; update both
methods to first check that mWindowBlockManagers contains windowSize and handle
the missing-key case gracefully (e.g., return an empty BlockMapIterRange for
getBlocksByHash and return a safe null-equivalent BlockPtr reference or throw a
clear std::out_of_range with contextual message for
getBlockFromReuseTreeByHash), and add Doxygen-style comments above each method
describing the purpose, parameters, return value, and exception behavior so
callers know what to expect.
virtual void pinBlocks(LlmRequest::RequestIdType requestId) = 0; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Document the pure virtual pinBlocks
method in the base class.
The pure virtual pinBlocks
method in BaseKVCacheManager
lacks documentation explaining the contract that derived classes must fulfill.
+ //! \brief Pin blocks associated with a request to prevent eviction.
+ //! \param requestId The ID of the request whose blocks should be pinned.
+ //! \details Implementations should mark all blocks associated with the request
+ //! as pinned to ensure they remain available for potential reuse.
virtual void pinBlocks(LlmRequest::RequestIdType requestId) = 0;
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
virtual void pinBlocks(LlmRequest::RequestIdType requestId) = 0; | |
//! \brief Pin blocks associated with a request to prevent eviction. | |
//! \param requestId The ID of the request whose blocks should be pinned. | |
//! \details Implementations should mark all blocks associated with the request | |
//! as pinned to ensure they remain available for potential reuse. | |
virtual void pinBlocks(LlmRequest::RequestIdType requestId) = 0; |
🤖 Prompt for AI Agents
In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around lines
1218-1219, the pure virtual method `pinBlocks(LlmRequest::RequestIdType
requestId)` in BaseKVCacheManager is undocumented; add a brief doxygen-style
comment above the declaration that states the method's contract: what pinning
means, when it will be called, expected behavior (e.g., mark blocks for the
given requestId as pinned to prevent eviction), whether it must be thread-safe,
return/exception expectations (void, should not throw), and any required side
effects or preconditions so derived classes implement consistent behavior.
void pinBlocks(LlmRequest::RequestIdType requestId); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing implementation declaration for pinBlocks
override.
The pinBlocks
method is declared without the override
keyword, which could lead to subtle bugs if the signature doesn't match the base class exactly.
- void pinBlocks(LlmRequest::RequestIdType requestId);
+ void pinBlocks(LlmRequest::RequestIdType requestId) override;
🤖 Prompt for AI Agents
In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around lines
1607-1608, the method declaration for pinBlocks is missing the override
specifier; update the declaration to match and explicitly mark it as overriding
the base class (e.g. change to "void pinBlocks(LlmRequest::RequestIdType
requestId) override;"), and verify the signature (const/&, noexcept) exactly
matches the base class method so the override is valid.
// TODO(zhengd): pass the hashes directly instead of from llmRequest; use hash instead of block num | ||
auto const& ids = blockRange.getBlockIds(); | ||
blockRange.setBlockIds({ids.end() - requestBlockNum, ids.end()}); | ||
blockRange.setBlockIds(blockHashes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type mismatch: setBlockIds expects vector but receives vector<size_t>.
The blockRange.setBlockIds()
method expects std::vector<SizeType32>
according to kvCacheUtils.h, but blockHashes
is of type std::vector<size_t>
. This could cause issues on platforms where size_t
and SizeType32
have different sizes.
Apply this fix to ensure type safety:
if (requestBlockNum < blockRange.size() && requestBlockNum > 0)
{
- blockRange.setBlockIds(blockHashes);
+ blockRange.setBlockIdsFromHashes(blockHashes);
}
This uses the correct method that accepts std::vector<size_t>
for hashes.
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp around line 55,
blockRange.setBlockIds(blockHashes) passes a std::vector<size_t> to a function
expecting std::vector<SizeType32>; replace the call with the overload that
accepts std::vector<size_t> (e.g., blockRange.setBlockHashes(blockHashes)) so
the types match and avoid unsafe casts or platform-dependent size mismatches.
void release(LlmRequest::RequestIdType requestId) | ||
{ | ||
auto it = mRequestToSession.find(requestId); | ||
TLLM_CHECK(it != mRequestToSession.end()); | ||
std::unique_lock<std::mutex> lk(mMtxForMap); | ||
mRequestToSession.erase(it); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lock acquired too late in release
method.
The iterator is obtained before acquiring the lock, which could lead to iterator invalidation if another thread modifies the map.
void release(LlmRequest::RequestIdType requestId)
{
+ std::unique_lock<std::mutex> lk(mMtxForMap);
auto it = mRequestToSession.find(requestId);
TLLM_CHECK(it != mRequestToSession.end());
- std::unique_lock<std::mutex> lk(mMtxForMap);
mRequestToSession.erase(it);
}
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp around lines 168-174, the
code calls mRequestToSession.find(requestId) before acquiring mMtxForMap, which
can allow the map to be modified and invalidate the iterator; move the lock
acquisition to before the find, perform the lookup while holding the mutex,
check the iterator under the lock (TLLM_CHECK or equivalent), then erase under
the same lock (use std::unique_lock<std::mutex> or std::lock_guard<std::mutex>
to hold mMtxForMap for the duration of find + check + erase).
/// @return The request information. | ||
[[nodiscard]] RequestInfo recvRequestInfo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Inconsistent method signatures between CacheSender and CacheReceiver.
The sendRequestInfo
and receiveSync
methods appear in both classes but with different purposes, which could be confusing:
CacheSender
hassendRequestInfo
returningTransferSession
andreceiveSync
takingTransferSession&
CacheReceiver
has the same methods with similar signatures
This naming overlap could lead to confusion about which class is responsible for what operations.
Consider renaming to clarify the direction of data flow:
class CacheSender
{
// ...
- [[nodiscard]] TransferSession sendRequestInfo(LlmRequest const& llmRequest);
- void receiveSync(TransferSession& session);
+ [[nodiscard]] TransferSession prepareSendSession(LlmRequest const& llmRequest);
+ void completeSendSession(TransferSession& session);
};
class CacheReceiver
{
// ...
- TransferSession sendRequestInfo(LlmRequest const& llmRequest);
- void receiveSync(TransferSession& session);
+ TransferSession prepareReceiveSession(LlmRequest const& llmRequest);
+ void completeReceiveSession(TransferSession& session);
};
Also applies to: 173-175
Signed-off-by: Iman Tabrizian <[email protected]>
1a21dba
to
bd2757d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
♻️ Duplicate comments (8)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
144-151
: Inconsistent method signatures between CacheSender and CacheReceiver.cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (3)
161-166
: Check thread safety of session map access.
168-174
: Lock acquired too late inrelease
method.
430-434
: Address TODO: avoid frequent thread creation.cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (4)
580-581
: Add documentation for the newpinBlocks
method.
1221-1222
: Document the pure virtualpinBlocks
method in the base class.
1610-1611
: Missing implementation declaration forpinBlocks
override.
901-901
: Document the pinBlocks method in BlockManager.The
pinBlocks
method inBlockManager
needs documentation.Add documentation:
+ /// @brief Pin blocks associated with a generation request to prevent eviction. + /// @param sequence The generation request whose blocks should be pinned. void pinBlocks(GenerationRequest& sequence);
🧹 Nitpick comments (1)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
45-56
: Consider making the tag constants more maintainable.The
TransceiverTag
struct with enumId
and the tag constants looks good. However, the hardcoded integer values (19, 22, 32) for tags might benefit from documentation or being derived from the enum values to avoid magic numbers.Consider adding a comment explaining the choice of these specific tag values:
struct TransceiverTag { enum class Id : uint64_t { REQUEST_SEND = 1, TERMINATION = 2 }; + // Tag values for MPI/communication - chosen to avoid conflicts with other subsystems static constexpr int32_t kID_TAG{19}; static constexpr int32_t kINFO_SIZE_TAG{22}; static constexpr int32_t kINFO_TAG{32}; };
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(8 hunks)cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
(1 hunks)cpp/include/tensorrt_llm/batch_manager/llmRequest.h
(0 hunks)cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
(3 hunks)cpp/tensorrt_llm/batch_manager/cacheFormatter.h
(2 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
(12 hunks)cpp/tensorrt_llm/batch_manager/dataTransceiver.h
(3 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(3 hunks)cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
(2 hunks)cpp/tensorrt_llm/common/envUtils.cpp
(0 hunks)cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
(2 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor.py
(3 hunks)
💤 Files with no reviewable changes (2)
- cpp/tensorrt_llm/common/envUtils.cpp
- cpp/include/tensorrt_llm/batch_manager/llmRequest.h
🚧 Files skipped from review as they are similar to previous changes (5)
- cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
- cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
- cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
- cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
- cpp/tensorrt_llm/batch_manager/cacheFormatter.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}
: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...
Files:
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,cxx,cc,cu}
: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)
Files:
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Parameter names must be consistent between declarations and definitions
Files:
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files
Files:
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
tensorrt_llm/_torch/pyexecutor/py_executor.py
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.py
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else
Files:
tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{h,hpp,hxx,hh,cuh}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)
Files:
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
🧠 Learnings (1)
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.
Applied to files:
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (10)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
42-44
: LGTM! Type aliases are well-defined.The new type aliases
SizeType32
,TransferSession
, andBaseCacheFormatter
are correctly defined and improve code readability.cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1)
348-348
: LGTM! Python binding for pinBlocks is correctly added.The binding for
pinBlocks
is properly placed afterscheduling_remove_sequence
and correctly exposes the C++ method to Python aspin_blocks
.tensorrt_llm/_torch/pyexecutor/py_executor.py (2)
183-183
: LGTM! Block reuse flag properly initialized.The
block_reuse_enabled
attribute is correctly initialized from the model engine's KV cache configuration.
1641-1645
: LGTM! Proper handling of block pinning based on reuse flag.The conditional logic correctly pins blocks when
block_reuse_enabled
is True and maintains backward compatibility when it's False.cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
118-119
: LGTM! Hash member added to BlockKey.The addition of the
hash
member toBlockKey
is a good optimization to avoid recomputing hashes.
174-178
: Good optimization for hash caching.The modification to return the cached hash when available is an efficient optimization.
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)
1-16
: LGTM: Properly formatted copyright header.The NVIDIA copyright header is correctly formatted with the current year (2025) and follows the Apache 2.0 license template as required by the coding guidelines.
1485-1491
: LGTM: Well-structured pin blocks delegation.The function properly delegates pinning operations to all window block managers following the established pattern used throughout the codebase.
1493-1501
: LGTM: Correct implementation of block pinning.The function correctly increases the reference count for all allocated blocks belonging to the sequence, which prevents them from being evicted. This aligns with the intended pinning mechanism.
2134-2138
: LGTM: Clean API for KV cache level block pinning.The function provides a clean interface for pinning blocks at the KVCacheManager level by resolving the sequence from requestId and delegating to the BlockManager. This follows good encapsulation principles.
explicit BlockKey(size_t hash) | ||
: hash{hash} | ||
{ | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Document the purpose of the hash-only constructor.
The constructor that takes only a hash value needs documentation explaining when it should be used.
Add documentation:
+ /// @brief Construct a BlockKey with a precomputed hash.
+ /// @param hash The precomputed hash value.
+ /// @details Used when retrieving blocks by hash from the reuse tree.
explicit BlockKey(size_t hash)
: hash{hash}
{
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
explicit BlockKey(size_t hash) | |
: hash{hash} | |
{ | |
} | |
/// @brief Construct a BlockKey with a precomputed hash. | |
/// @param hash The precomputed hash value. | |
/// @details Used when retrieving blocks by hash from the reuse tree. | |
explicit BlockKey(size_t hash) | |
: hash{hash} | |
{ | |
} |
🤖 Prompt for AI Agents
In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around lines 132 to
135, add a documentation comment for the explicit BlockKey(size_t hash)
constructor that explains it constructs a key from a precomputed hash (used for
lookups/comparisons where only the hash is available), clarifies it does not
populate other key fields, notes any invariants or expected source of the hash
(e.g., result of BlockKey::hash() or external hashing function), and documents
intended usage scenarios (fast-lookup, temporary key for map/set operations) and
ownership/lifetime implications.
@@ -737,6 +750,8 @@ | |||
return 0; | |||
} | |||
|
|||
[[nodiscard]] std::optional<KVCacheBlock> findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Document the new hash-based lookup method.
The findBlocksInReuseTreeByHashes
method needs documentation explaining its purpose and return value.
Add documentation:
+ /// @brief Find blocks in the reuse tree matching the given hashes.
+ /// @param hashes Vector of hash values to search for.
+ /// @return Optional KVCacheBlock if matching blocks are found.
[[nodiscard]] std::optional<KVCacheBlock> findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const;
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
[[nodiscard]] std::optional<KVCacheBlock> findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const; | |
/// @brief Find blocks in the reuse tree matching the given hashes. | |
/// @param hashes Vector of hash values to search for. | |
/// @return Optional KVCacheBlock if matching blocks are found. | |
[[nodiscard]] std::optional<KVCacheBlock> findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const; |
🤖 Prompt for AI Agents
In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around line 753, the
new method declaration findBlocksInReuseTreeByHashes lacks documentation; add a
Doxygen-style comment above it that briefly describes the method's purpose
(lookup KVCacheBlock(s) in the reuse tree by a vector of hash values), explains
the parameter (hashes: vector of size_t hash keys to match), describes the
return value (std::optional<KVCacheBlock> — present when a matching block is
found, std::nullopt when none match), and any thread-safety or performance notes
(e.g., const method, does not modify state, complexity if relevant). Ensure the
comment is concise and follows existing file doc style.
static int32_t tagFromRequestId(LlmRequest::RequestIdType requestId) | ||
{ | ||
constexpr int32_t kDATA_TAG{43}; | ||
return ((requestId & 0xFFF) << 8) | (kDATA_TAG & 0xFF); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Document the magic number in tagFromRequestId.
The function uses a magic number 0xFFF
for masking and 43
as a base tag. These should be documented or made configurable.
Add documentation and consider using named constants:
+// Generate a unique tag for MPI communication based on request ID
+// Uses lower 12 bits of request ID to avoid tag overflow
+static constexpr int32_t kRequestIdMask{0xFFF};
+static constexpr int32_t kDataTagBase{43};
+
static int32_t tagFromRequestId(LlmRequest::RequestIdType requestId)
{
- constexpr int32_t kDATA_TAG{43};
- return ((requestId & 0xFFF) << 8) | (kDATA_TAG & 0xFF);
+ return ((requestId & kRequestIdMask) << 8) | (kDataTagBase & 0xFF);
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
static int32_t tagFromRequestId(LlmRequest::RequestIdType requestId) | |
{ | |
constexpr int32_t kDATA_TAG{43}; | |
return ((requestId & 0xFFF) << 8) | (kDATA_TAG & 0xFF); | |
} | |
// Generate a unique tag for MPI communication based on request ID | |
// Uses lower 12 bits of request ID to avoid tag overflow | |
static constexpr int32_t kRequestIdMask{0xFFF}; | |
static constexpr int32_t kDataTagBase{43}; | |
static int32_t tagFromRequestId(LlmRequest::RequestIdType requestId) | |
{ | |
return ((requestId & kRequestIdMask) << 8) | (kDataTagBase & 0xFF); | |
} |
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp around lines 41 to 45, the
bitmask 0xFFF and the numeric tag 43 are used without explanation; replace these
magic numbers with clearly named constexpr constants (e.g., kRequestIdMask =
0xFFF, kRequestIdShift = 8, kDataTag = 43 or an enum for tag types), add a short
comment documenting the tag layout (how many bits are reserved for requestId,
where the tag sits, why 0xFFF is chosen), and consider making kDataTag
configurable at a higher scope if different tags may be needed; then update the
function to use those named constants so the intent is explicit and
maintainable.
/// @brief Receive the request information. | ||
/// @return The request information. | ||
[[nodiscard]] RequestInfo recvRequestInfo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Document the purpose of new methods in CacheSender.
The recvRequestInfo
method lacks documentation explaining its purpose and when it should be called.
Add documentation:
+ /// @brief Receive request information from a peer.
+ /// @return The request information received.
[[nodiscard]] RequestInfo recvRequestInfo();
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/dataTransceiver.h around lines 136-138, the
recvRequestInfo() declaration lacks a descriptive comment; add a doc comment
that states this method receives and returns the next RequestInfo from the
transceiver, describes the expected caller (e.g., worker or consumer thread),
when it should be called (blocking vs non-blocking behavior, or only after a
prior handshake/enqueue), what the return value represents and ownership
semantics, and any thread-safety or error conditions (e.g., may throw or return
empty/default if closed). Ensure the comment is concise and placed immediately
above the method declaration.
/// @brief Synchronously send data. | ||
/// @param llmRequest The request object to which the data belongs. | ||
void sendSync(LlmRequest const& llmRequest); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Document the synchronous send method.
The sendSync
method lacks documentation explaining its blocking behavior and prerequisites.
Add documentation:
+ /// @brief Synchronously send data for the given request.
+ /// @param llmRequest The request object containing the data to send.
+ /// @details This method blocks until the data is sent. The request's data must be ready when called.
void sendSync(LlmRequest const& llmRequest);
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/dataTransceiver.h around lines 140-142, the
sendSync declaration lacks a detailed doc comment describing that it is a
blocking, synchronous send; update the Doxygen comment to state that sendSync
blocks the caller until the data for the provided LlmRequest is fully
transmitted (or an acknowledgement/confirmation is received), list preconditions
(e.g., required buffer/state, not to be called from UI/main thread), specify
thread-safety (whether it is thread-safe or must be serialized by caller),
document failure modes and how they are reported (exceptions thrown or error
return semantics) and any timeout or cancellation behavior; keep the comment
concise and actionable so callers know when to use sendSync vs async send.
std::optional<KVCacheBlock> WindowBlockManager::findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const | ||
{ | ||
std::vector<BlockKey> blockKeys; | ||
for (auto const& hash : hashes) | ||
{ | ||
blockKeys.emplace_back(hash); | ||
} | ||
|
||
auto searchRoot = mCachedBlocksRoot; | ||
for (auto const& blockKey : blockKeys) | ||
{ | ||
auto [partialMatch, numMatched, matchingBlock] = searchRoot != nullptr | ||
? searchRoot->findMatchingBlock(blockKey, false, false) | ||
: std::make_tuple(false, 0, nullptr); | ||
if (matchingBlock == nullptr) | ||
{ | ||
return nullptr; | ||
} | ||
searchRoot = std::move(matchingBlock); | ||
} | ||
return searchRoot; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add input validation and improve error handling.
The function doesn't validate the input hashes
parameter and could return unexpected results if an empty vector is passed.
Apply this diff to add input validation:
std::optional<KVCacheBlock> WindowBlockManager::findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const
{
+ if (hashes.empty())
+ {
+ return std::nullopt;
+ }
+
std::vector<BlockKey> blockKeys;
for (auto const& hash : hashes)
{
blockKeys.emplace_back(hash);
}
auto searchRoot = mCachedBlocksRoot;
for (auto const& blockKey : blockKeys)
{
auto [partialMatch, numMatched, matchingBlock] = searchRoot != nullptr
? searchRoot->findMatchingBlock(blockKey, false, false)
: std::make_tuple(false, 0, nullptr);
if (matchingBlock == nullptr)
{
- return nullptr;
+ return std::nullopt;
}
searchRoot = std::move(matchingBlock);
}
return searchRoot;
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
std::optional<KVCacheBlock> WindowBlockManager::findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const | |
{ | |
std::vector<BlockKey> blockKeys; | |
for (auto const& hash : hashes) | |
{ | |
blockKeys.emplace_back(hash); | |
} | |
auto searchRoot = mCachedBlocksRoot; | |
for (auto const& blockKey : blockKeys) | |
{ | |
auto [partialMatch, numMatched, matchingBlock] = searchRoot != nullptr | |
? searchRoot->findMatchingBlock(blockKey, false, false) | |
: std::make_tuple(false, 0, nullptr); | |
if (matchingBlock == nullptr) | |
{ | |
return nullptr; | |
} | |
searchRoot = std::move(matchingBlock); | |
} | |
return searchRoot; | |
} | |
std::optional<KVCacheBlock> WindowBlockManager::findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const | |
{ | |
if (hashes.empty()) | |
{ | |
return std::nullopt; | |
} | |
std::vector<BlockKey> blockKeys; | |
for (auto const& hash : hashes) | |
{ | |
blockKeys.emplace_back(hash); | |
} | |
auto searchRoot = mCachedBlocksRoot; | |
for (auto const& blockKey : blockKeys) | |
{ | |
auto [partialMatch, numMatched, matchingBlock] = searchRoot != nullptr | |
? searchRoot->findMatchingBlock(blockKey, false, false) | |
: std::make_tuple(false, 0, nullptr); | |
if (matchingBlock == nullptr) | |
{ | |
return std::nullopt; | |
} | |
searchRoot = std::move(matchingBlock); | |
} | |
return searchRoot; | |
} |
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp around lines 1011 to 1032,
validate the input and improve null handling: if hashes.empty() immediately
return std::nullopt; if mCachedBlocksRoot is nullptr also return std::nullopt
before looping; inside the loop keep the existing search logic but when
matchingBlock is nullptr return std::nullopt (not nullptr); at the end, return a
populated std::optional<KVCacheBlock> constructed from the found block (or
std::nullopt if no block), ensuring you convert pointer results into the
optional return type consistently.
💡 Verification agent
🧩 Analysis chain
Verify integration with new BlockKey hash constructor.
The function relies on BlockKey(hash)
constructor to initialize keys from precomputed hashes. Ensure this constructor is properly implemented and tested.
🏁 Script executed:
#!/bin/bash
# Search for BlockKey constructor that takes a hash parameter
ast-grep --pattern $'class BlockKey {
$$$
BlockKey($_, size_t $_) {
$$$
}
$$$
}'
# Also search for any BlockKey constructor with single parameter
ast-grep --pattern $'BlockKey(size_t $_)'
Length of output: 136
🏁 Script executed:
#!/bin/bash
# Search for BlockKey definition and any constructors across the repo
rg "class BlockKey" -n -C3
rg "struct BlockKey" -n -C3
# Look for constructor implementations and declarations
rg "BlockKey::BlockKey" -n -C3
rg "explicit BlockKey" -n -C3
rg "BlockKey(size_t" -n -C3
Length of output: 3613
🏁 Script executed:
#!/bin/bash
# Display BlockKeyTest to verify coverage of the size_t constructor
sed -n '1,200p' cpp/tests/batch_manager/blockKeyTest.cpp
Length of output: 674
Ensure unit tests cover the hash‐based BlockKey constructor
The explicit BlockKey(size_t)
ctor in kvCacheManager.h
(lines 132–136) is used by findBlocksInReuseTreeByHashes
but isn’t exercised by any existing test. Please add or extend unit tests to cover:
- Direct construction via
BlockKey(hash)
- Its interaction with
findMatchingBlock
/findBlocksInReuseTreeByHashes
This will validate that precomputed hashes are handled correctly.
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp around lines 1011 to 1032,
add unit tests that exercise the explicit BlockKey(size_t) constructor and its
use in findBlocksInReuseTreeByHashes: construct BlockKey objects directly from
precomputed hashes (BlockKey(hash)), insert/create cached blocks in the reuse
tree as needed, call findBlocksInReuseTreeByHashes with vectors of those hashes,
and assert the function returns the expected KVCacheBlock (positive case) and
returns null/empty when hashes don’t match (negative case); include a test that
verifies findMatchingBlock works equivalently when using the size_t ctor versus
the standard ctor to ensure hash-based lookup behaves correctly.
if not self.block_reuse_enabled: | ||
ctx_transmission_reqs = self._send_disagg_ctx_cache( | ||
scheduled_batch.context_requests | ||
) if self.kv_cache_transceiver else [] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review the duplicate disagg context cache sending logic.
The code appears to have duplicate logic for sending disagg context cache - once when block_reuse_enabled
is False and again unconditionally. This could lead to sending the cache twice.
The logic should be consolidated:
- if not self.block_reuse_enabled:
- ctx_transmission_reqs = self._send_disagg_ctx_cache(
- scheduled_batch.context_requests
- ) if self.kv_cache_transceiver else []
-
if self.kv_cache_transceiver:
ctx_transmission_reqs = self._send_disagg_ctx_cache(
scheduled_batch.context_requests)
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/py_executor.py around lines 961 to 965, there
is duplicate logic that can send the disagg context cache twice: first inside
the if not self.block_reuse_enabled branch and then again unconditionally;
consolidate this by sending the cache exactly once—either keep the send inside
the block_reuse_enabled=False branch and remove the unconditional send, or make
the unconditional send conditional on block_reuse_enabled being False; ensure
the send uses the same kv_cache_transceiver guard (i.e., only call
self._send_disagg_ctx_cache(...) when self.kv_cache_transceiver is truthy) and
remove the redundant call so the cache cannot be sent twice.
@@ -80,6 +80,19 @@ class BlockRange | |||
mBlockIds = std::move(blockIds); | |||
} | |||
|
|||
void setBlockIdsFromHashes(std::vector<size_t> blockHashes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const std::vector<size_t>&
is better here.
@@ -39,23 +39,20 @@ | |||
namespace tensorrt_llm::batch_manager::kv_cache_manager | |||
{ | |||
|
|||
BlockRange getBlockRangeForSending(BaseKVCacheManager* cacheManager, LlmRequest const& llmRequest) | |||
BlockRange getBlockRangeForSending( | |||
BaseKVCacheManager* cacheManager, LlmRequest const& llmRequest, std::vector<size_t> const& blockHashes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to remove the llmRequest
parameter in the future?
Summary by CodeRabbit
Refactor
New Features
Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.