feat: Add support for KVCache transfer from KVCache reuse path #6348

Tabrizian · 2025-07-25T02:23:30Z

Summary by CodeRabbit

Refactor
- Redesigned cache transfer/session handling for robust multi-connection and agent-based transmission; simplified communication state and removed legacy transceiver components.
New Features
- Per-request KV-cache pinning to prevent eviction during processing, exposed to Python.
- Block-hash-driven selective block transfer and optional per-transfer metrics collection/export.
- Executor toggle to change cache-transmission behavior based on block-reuse setting.
Tests
- Updated tests to match the new transfer/session architecture.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-07-25T02:23:37Z

📝 Walkthrough

Walkthrough

Replaces DataResponder/DataRequester/DataSender/DataReceiver with CacheSender/CacheReceiver, introduces TransferSession and session-based cache-formatting/transfer APIs, adds block-hash driven send/receive paths and pinning APIs, removes legacy impl files, updates agent/connection naming, and adjusts tests and build lists accordingly.

Changes

Cohort / File(s)	Change Summary
Transceiver header `cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h`	Forward declarations renamed (`DataResponder` → `CacheSender`, `DataRequester` → `CacheReceiver`); member pointers and future vector renamed (`mDataResponder`→`mCacheSender`, `mDataRequester`→`mCacheReceiver`, `mResponderFutures`→`mSenderFutures`).
Transceiver impl `cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`	Uses `CacheSender`/`CacheReceiver` instead of `DataResponder`/`DataRequester`; send calls use `sendAsync()` and store futures in `mSenderFutures`; receive uses `receiveAsync()`; removed old impl include.
Core redesign (wrappers & impl) `cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp`, `cpp/tensorrt_llm/batch_manager/dataTransceiver.h`	Introduced `CacheSender` and `CacheReceiver` wrappers with PIMPLs and new APIs (sendAsync/receiveAsync, get/set comm state, request-info/session helpers); added aliases (`SizeType32`, `TransferSession`, `BaseCacheFormatter`) and `TransceiverTag`; major multi-connection, TransferSession and promise-based send/receive logic added.
Removed legacy impls & build `cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp`, `.../dataTransceiverImpl.h`, `cpp/tensorrt_llm/batch_manager/CMakeLists.txt`	Deleted legacy DataSenderImpl/DataReceiverImpl and their header; removed `dataTransceiverImpl.cpp` from SRCS in CMakeLists.
Cache formatter & session API `cpp/include/tensorrt_llm/batch_manager/cacheFormatter.h`, `.../cacheFormatter.cpp`, `.../mlaCacheFormatter.cpp`	Added `TransferSession`, `KvCacheMeasureHelper`, `BaseCacheFormatter` interface; CacheFormatter/MLACacheFormatter rewritten to use TransferSession; `getBlockRangeForSending` now accepts block-hashes and sending logic uses provided block-hash list.
KV-cache manager & utils `cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h`, `.../kvCacheManager.cpp`, `.../kvCacheUtils.h`	Added block-pinning APIs (`pinBlocks`) across managers, reuse-tree lookup by hashes (`findBlocksInReuseTreeByHashes`), `BlockKey.hash` field and constructor, and `BlockRange::setBlockIdsFromHashes`.
Request API change `cpp/include/tensorrt_llm/batch_manager/llmRequest.h`	Removed public accessors for requested block hashes (setter and getter removed).
Agent connection rename `cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h`, `.../connection.cpp`	Renamed parameter/member `mReceiverBufferDesc` → `mCacheReceiverBufferDesc` in AgentConnection sender state and setter.
Includes & tests `cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp`, `cpp/tests/batch_manager/cacheTransceiverTest.cpp`, `cpp/tests/unit_tests/executor/ucxCommTest.cpp`	Updated includes to `dataTransceiver.h`; tests and mocks renamed from `Data` → `Cache`; removed redundant include lines.
Env utility removal `cpp/tensorrt_llm/common/envUtils.cpp`	Removed `getEnvDisableSelectiveCacheTransfer()` (environment-based selective-transfer flag removed).
Python binding & executor behavior `cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp`, `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Exposed `BaseKVCacheManager::pinBlocks` as `pin_blocks` in Python; added `block_reuse_enabled` flag to PyExecutor and conditional flow to pin/terminate or keep ctx-in-transmission depending on flag.

Sequence Diagram(s)

sequenceDiagram
    participant Request as LlmRequest
    participant Sender as CacheSender
    participant ConnMgr as ConnectionManager
    participant Formatter as BaseCacheFormatter
    participant Receiver as CacheReceiver

    Request->>Sender: sendAsync(request)
    Sender->>ConnMgr: build TransferSession (connections, states, blockHashes)
    Sender->>Formatter: format(TransferSession)
    Sender->>ConnMgr: transmit data (async)
    ConnMgr->>Receiver: deliver request info / data
    Receiver->>Formatter: unformat(TransferSession)
    Receiver->>Request: complete receive (future)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

[None][feat] move kv cache measure into transfer session #6633 — overlaps on KV-cache measurement and TransferSession/measurement refactor.
[None][fix]revert kvcache transfer #6709 — touches CacheFormatter transfer-decision and connection-selection logic.
feat: nanobind bindings #6185 — related changes to CacheTransceiver/CacheSender/CacheReceiver public APIs and bindings.

Suggested labels

Community want to contribute, Generic Runtime

Suggested reviewers

pcastonguay
chuangz0
DomBrown
Funatiq
Shixiaowei02

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 3

🔭 Outside diff range comments (4)

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (1)
18-18: Use preprocessor guard instead of pragma once.

According to coding guidelines, header files should use a preprocessor guard with prefix TRTLLM_ followed by the filename in caps.

Replace #pragma once with a proper header guard:
-#pragma once
+#ifndef TRTLLM_CONNECTION_H
+#define TRTLLM_CONNECTION_H
And add at the end of the file:
+#endif // TRTLLM_CONNECTION_H
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (2)
17-17: Use preprocessor guard instead of pragma once.

According to coding guidelines, header files should use a preprocessor guard with prefix TRTLLM_ followed by the filename in caps.

Replace #pragma once with a proper header guard:
-#pragma once
+#ifndef TRTLLM_CACHETRANSCEIVER_H
+#define TRTLLM_CACHETRANSCEIVER_H
And add at the end of the file:
+#endif // TRTLLM_CACHETRANSCEIVER_H
2-2: Update copyright year to include 2025.

According to coding guidelines, all TensorRT-LLM code should contain an NVIDIA copyright header that includes the current year (2025).
- * Copyright (c) 2023-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2023-2025, NVIDIA CORPORATION.  All rights reserved.
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)
655-658: Remove unused variable assignment.

The variable cacheType is assigned but never used. This appears to be dead code.
-        if (kvFactor == 1)
-        {
-            auto cacheType = CacheType::kSELFKONLY;
-        }

🧹 Nitpick comments (2)

examples/disaggregated/clients/disagg_client.py (1)

81-126: Consider adding timing measurements to send_chat_request for consistency.

The send_chat_request function lacks the same TTFT and ITL timing measurements that were added to send_request. This creates inconsistent behavior between the two endpoints.

Consider adding similar timing measurements to send_chat_request or extracting the timing logic into a shared helper function to maintain consistency across both endpoints.

examples/disaggregated/clients/prompts.json (1)

1-3: Consider maintaining diverse test prompts for comprehensive testing.

While simplifying to a single prompt reduces complexity, having diverse prompts helps test various scenarios (different lengths, complexity, languages, etc.). Consider whether this reduction impacts test coverage adequately.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f2f11f and eed674f.

📒 Files selected for processing (13)

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (2 hunks)
cpp/tensorrt_llm/batch_manager/CMakeLists.txt (0 hunks)
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (3 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (7 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (15 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (3 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (0 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h (0 hunks)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (2 hunks)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (2 hunks)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (11 hunks)
examples/disaggregated/clients/disagg_client.py (3 hunks)
examples/disaggregated/clients/prompts.json (1 hunks)

💤 Files with no reviewable changes (3)

cpp/tensorrt_llm/batch_manager/CMakeLists.txt
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp

🧰 Additional context used

📓 Path-based instructions (4)

**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0, nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and must be case-insensitive unique within a compilation target.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants are uppercase snakecase with prefix...

Files:

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
examples/disaggregated/clients/disagg_client.py
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

examples/disaggregated/clients/disagg_client.py

**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h

🧠 Learnings (3)

cpp/tensorrt_llm/batch_manager/cacheFormatter.h (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🔇 Additional comments (24)

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (1)

178-178: Consistent renaming verified

I performed a repository-wide search and confirmed there are no remaining occurrences of mReceiverBufferDesc. All references in

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp

have been updated to mCacheReceiverBufferDesc.

LGTM!

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (2)

84-84: LGTM! Consistent implementation of the renamed member variable.

The changes correctly implement the renamed mCacheReceiverBufferDesc member variable throughout the implementation, maintaining consistency with the header file updates.

Also applies to: 140-142

468-468: LGTM! Proper namespace closing comment.

The namespace closing comment follows the coding guidelines correctly.

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)

37-38: Renaming Verification Complete

All references to the old class names have been removed:

No matches for DataResponder

No matches for DataRequester

No matches for mDataSender or mDataReceiver

The renaming to CacheSender/CacheReceiver is consistent and complete. No further action required.

cpp/tests/batch_manager/cacheTransceiverTest.cpp (6)

159-185: LGTM! Mock class properly renamed and implemented.

The mock class follows appropriate testing patterns with proper use of Google Mock features.

186-192: LGTM! Mock receiver class properly implemented.

The mock methods are correctly defined for testing cache receiving functionality.

211-230: LGTM! Test properly updated to use renamed classes.

The test correctly uses CacheSender instead of DataResponder while maintaining the original test logic.

232-252: LGTM! Test properly uses CacheReceiver.

The test correctly replaces DataRequester with CacheReceiver while maintaining test functionality.

396-404: LGTM! Cache transceiver setup properly updated.

The setup correctly instantiates CacheSender and CacheReceiver with proper implementations using CacheSenderImpl and CacheReceiverImpl.

762-768: LGTM! Asymmetric cache test properly updated.

The test correctly uses CacheSender and CacheReceiver with their implementation classes.

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (3)

196-200: LGTM! Proper initialization of cache sender and receiver.

The constructor correctly initializes CacheSender and CacheReceiver with appropriate parameters and formatter creation.

236-298: LGTM! Methods properly updated to use new cache classes.

All async send/receive operations correctly use mCacheSender and mCacheReceiver with proper future handling.

369-436: LGTM! Context transfer status checking properly updated.

The method correctly uses mSenderFutures (renamed from mResponderFutures) while maintaining the original synchronization logic.

cpp/tensorrt_llm/batch_manager/cacheFormatter.h (2)

45-122: LGTM! Well-designed TransferSession class.

The class provides a clean abstraction for managing transfer sessions with proper const-correctness, precondition checks, and encapsulation.

163-222: LGTM! Performance measurement helper properly implemented.

The class provides thread-safe performance measurement with proper file output handling. Good use of RAII pattern for automatic output on destruction.

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

115-183: LGTM! Clean interface design with pImpl pattern.

The CacheSender and CacheReceiver classes provide clean interfaces with proper use of the pImpl idiom for implementation hiding and ABI stability.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (8)

1-32: LGTM: Copyright header and includes look good.

The copyright header contains the current year (2025) as required, and the new include for agent connection utilities aligns with the refactoring objectives.

38-57: Good RAII design and helper functions.

The new helper structures and functions are well-designed:

Using declarations improve readability

ReceiveCacheResource follows RAII principles correctly

tagFromRequestId implements proper bit manipulation for tag generation

The magic number kDATA_TAG{43} is appropriately named as a constexpr constant.

115-174: Well-designed constructor and core methods.

The implementation demonstrates good practices:

Constructor properly validates inputs with TLLM_CHECK

Resource management uses smart pointers appropriately

Thread safety is maintained with proper mutex usage

Dependency injection pattern is correctly implemented

The new methods getCounterpartsCount and release provide clean session management.

176-228: Complex but well-structured request info handling.

The method correctly handles both agent and direct connection scenarios:

Appropriate use of dynamic_cast for type discrimination

Comprehensive error handling with informative messages

Proper thread safety with mutex protection

Session management is correctly implemented

The method's complexity is justified given the dual connection handling requirements.

230-345: Excellent thread management and synchronization.

The implementation demonstrates solid concurrent programming practices:

Clean separation between sync and async operations

Proper use of condition variables and mutexes for thread coordination

Exception handling ensures promises are always resolved

Thread naming (dataTransResp) aids in debugging

Producer-consumer pattern is correctly implemented

The variable renaming from responder to sender terminology is consistent with the refactor.

405-529: Solid CacheReceiver implementation with good async handling.

The receiver implementation mirrors the sender's good practices:

Constructor properly validates inputs and manages resources

Async operations are correctly implemented with std::async

Complex connection handling for both agent and direct scenarios

Proper resource management with ReceiveCacheResource

Thread safety maintained throughout

The symmetry with CacheSender provides a clean, consistent API.

560-698: Well-implemented async resource management.

The async processing implementation demonstrates good concurrent design:

AsyncResource properly encapsulates thread synchronization primitives

Worker thread pattern correctly processes queued requests

Proper termination handling prevents resource leaks

Exception handling ensures all promises are resolved

Thread naming (dataTransRequest) aids debugging

Use of std::deque for request queue is appropriate

The cleanup logic in the destructor properly terminates all worker threads.

700-736: Clean public interface implementation using pimpl idiom.

The public interfaces are well-designed:

Proper use of pimpl idiom with std::unique_ptr<Impl>

Constructor delegation to implementation is clean

Method delegation is straightforward and maintains const correctness

Defaulted destructors are appropriate with smart pointer members

The namespace closing comment follows the coding guidelines correctly.

examples/disaggregated/clients/disagg_client.py

Tabrizian · 2025-07-25T02:28:30Z

/bot run

tensorrt-cicd · 2025-07-25T02:34:58Z

PR_Github #12929 [ run ] triggered by Bot

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)
420-424: Consider thread pool instead of frequent thread creation.

The TODO comment indicates awareness of the performance issue with frequent thread creation. The current implementation uses std::async with std::launch::async which may create new threads frequently.

Consider implementing a thread pool for better performance:
- return std::async(std::launch::async, &CacheReceiver::Impl::requestSync, this, std::ref(llmRequest));
+ // Use thread pool implementation instead of direct async call
+ return mThreadPool.submit(&CacheReceiver::Impl::requestSync, this, std::ref(llmRequest));

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 376d131 and 90d8774.

📒 Files selected for processing (11)

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (2 hunks)
cpp/tensorrt_llm/batch_manager/CMakeLists.txt (0 hunks)
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (3 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (7 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (15 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (3 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (0 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h (0 hunks)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (2 hunks)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (2 hunks)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (11 hunks)

💤 Files with no reviewable changes (3)

cpp/tensorrt_llm/batch_manager/CMakeLists.txt
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp

🚧 Files skipped from review as they are similar to previous changes (6)

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/tests/batch_manager/cacheTransceiverTest.cpp

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0, nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and must be case-insensitive unique within a compilation target.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants are uppercase snakecase with prefix...

Files:

cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp

**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

cpp/tensorrt_llm/batch_manager/dataTransceiver.h

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp

🧠 Learnings (2)

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (48)

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (17)

24-24: LGTM: Include added for new cache formatter abstraction.

The new include for cacheFormatter.h is correctly added to support the refactored architecture that uses BaseCacheFormatter and TransferSession from the kv_cache_manager namespace.

42-44: LGTM: Type aliases properly updated for the refactored architecture.

The type aliases correctly reference the new abstractions:

SizeType32 properly aliased from runtime namespace

TransferSession correctly aliased from kv_cache_manager namespace, replacing the previous internal definition

45-58: LGTM: TransceiverTag struct provides proper message tagging.

The TransceiverTag struct correctly defines:

Enumerated message types for REQUEST_SEND and TERMINATION

Appropriate tag constants for ID, INFO_SIZE, and INFO message components

Uses constexpr for compile-time constants as per coding guidelines

The naming follows the coding guidelines with uppercase snake_case for constants prefixed with 'k'.

58-58: LGTM: BaseCacheFormatter type alias correctly defined.

The type alias correctly references the BaseCacheFormatter from the kv_cache_manager namespace, maintaining consistency with the new architecture.

115-121: LGTM: CacheSender constructor signature properly designed.

The constructor correctly accepts:

ConnectionManager* for managing connections

CacheState for cache state information

SizeType32 for self index

std::unique_ptr<BaseCacheFormatter> for cache formatting operations

The use of std::unique_ptr follows C++ best practices for single resource ownership as specified in the coding guidelines.

122-126: LGTM: sendAsync method signature correctly updated.

The method signature properly:

Returns std::future<void> for asynchronous operation

Takes LlmRequest& as parameter

Uses [[nodiscard]] attribute appropriately

Includes comprehensive documentation

130-134: LGTM: Communication state accessors properly updated.

The getCommState() and setCommState() methods correctly use the new executor::kv_cache::CommState type, maintaining the same interface pattern as before but with updated types.

136-150: LGTM: Additional methods properly defined for the refactored architecture.

The new methods correctly provide:

recvRequestInfo() for receiving request information

sendSync() for synchronous sending

sendRequestInfo() returning TransferSession

receiveSync() accepting TransferSession&

All methods use appropriate parameter types and return types for the new architecture.

153-153: LGTM: Destructor properly declared.

The destructor is correctly declared and will be implemented to properly clean up resources.

160-177: LGTM: CacheReceiver class properly designed.

The CacheReceiver class correctly:

Has matching constructor signature with CacheSender

Provides receiveAsync() method for asynchronous reception

Includes sendRequestInfo() and receiveSync() methods for coordination

Uses proper parameter and return types

Follows the same design patterns as CacheSender

24-24: LGTM! Proper include for new dependency.

The inclusion of cacheFormatter.h aligns with the refactoring to use kv_cache_manager types.

42-44: LGTM! Clean type aliasing approach.

The type aliases consolidate dependencies and the TransferSession alias properly delegates to the kv_cache_manager namespace as intended by the refactor.

45-57: LGTM! Well-structured communication protocol.

The TransceiverTag struct provides a clean enumeration of message types and associated tag constants for the communication protocol.

58-58: LGTM! Consistent type aliasing.

The BaseCacheFormatter alias maintains consistency with the delegation pattern to kv_cache_manager.

160-182: LGTM! Well-designed receiver interface.

The CacheReceiver class interface mirrors the sender pattern appropriately, with proper constructor signature and async receive capability.

184-184: LGTM! Proper namespace closure comment.

The namespace closure comment follows the coding guidelines requirement for namespace comments.

115-158: Constructor Validation Confirmed – No Action Needed

The Impl constructor already enforces non-null and consistent state checks:

Line 127: TLLM_CHECK(mManager); ensures manager is not null.

Line 128: TLLM_CHECK(mManager->getCommState().getSelfIdx() == selfIndex); verifies selfIndex matches the manager’s state.

All public methods simply forward to mImpl, maintaining interface consistency.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (31)

26-26: LGTM: Include added for agent connection utilities.

The new include for agent connection utilities is correctly added to support the refactored connection management functionality.

38-45: LGTM: Helper function and type aliases properly implemented.

The code correctly:

Defines type aliases for the agent connection manager and data context

Implements tagFromRequestId() helper function with proper bit manipulation to create unique tags

Uses constexpr for the constant as per coding guidelines

47-57: LGTM: ReceiveCacheResource struct properly designed.

The struct correctly:

Manages buffer manager and CUDA event resources

Uses move semantics in the constructor

Follows proper resource management patterns

Uses member initializer list as per C++ best practices

115-132: LGTM: CacheSender::Impl constructor properly implemented.

The constructor correctly:

Initializes all member variables using member initializer list

Performs proper null pointer checks with TLLM_CHECK

Validates self index consistency

Sets up CUDA device context

Launches the response thread asynchronously

The use of std::make_shared for CudaStream follows smart pointer best practices.

134-149: LGTM: sendAsync method properly implemented with thread synchronization.

The method correctly:

Creates promise/future pair for asynchronous operation

Uses proper mutex locking for thread safety

Adds requests to the ready responses queue

Notifies waiting threads via condition variable

Returns the future for caller synchronization

The nested locking approach prevents race conditions.

230-237: LGTM: sendSync method properly implemented.

The method correctly:

Looks up the transfer session by request ID

Sets the LLM request on the session

Delegates formatting to the formatter

Includes proper error checking

251-265: LGTM: sendAndRemoveResponse method with proper exception handling.

The method correctly:

Uses CUDA device setting for proper context

Calls synchronous send and cleanup

Provides proper exception handling with promise notification

Uses noexcept specification appropriately

267-345: LGTM: Response thread implementation with comprehensive synchronization.

The response thread correctly:

Sets thread name for debugging

Uses proper CUDA device context

Implements condition variable wait with proper predicates

Handles termination gracefully

Manages request counting and cleanup

Supports both parallel and sequential sending modes

Includes comprehensive error handling and logging

The thread synchronization logic appears sound with proper use of mutexes and condition variables.

408-418: LGTM: CacheReceiver::Impl constructor properly implemented.

The constructor follows the same pattern as CacheSender::Impl with:

Proper member initialization

Null pointer validation

Self index consistency check

CUDA device setup

460-463: LGTM: receiveSync method properly delegates to formatter.

The method correctly delegates the unformatting operation to the formatter using the transfer session, maintaining proper separation of concerns.

465-529: LGTM: sendRequestInfo method with comprehensive connection handling.

The method correctly:

Validates formatter support for cache states

Handles selective cache transfer optimization

Supports both agent and non-agent connection managers

Manages buffer index assignment for agent connections

Creates transfer session with proper parameters

Includes comprehensive error checking and validation

The dual path handling for agent vs. non-agent connections is properly implemented.

531-547: LGTM: getReceiveCacheResource method with proper resource management.

The method correctly:

Uses scoped lock for thread safety

Implements per-process resource isolation when concurrent mode is enabled

Creates resources on-demand with proper RAII

Uses default process string for non-concurrent mode

549-559: LGTM: sendRequestInfo helper method properly implemented.

The helper method correctly:

Serializes request info using the established serialization utilities

Sends data in the proper order (ID, size, then data)

Uses appropriate data contexts for different message components

700-704: LGTM: CacheSender public constructor properly implemented.

The constructor correctly creates the implementation with perfect forwarding of the formatter parameter using std::move.

706-721: LGTM: CacheSender public interface methods properly implemented.

All public methods correctly delegate to the implementation:

sendAsync() forwards the request properly

getCommState() and setCommState() maintain the interface contract

Destructor is properly defaulted

723-734: LGTM: CacheReceiver public interface properly implemented.

The constructor and methods correctly:

Create implementation with proper parameter forwarding

Delegate receiveAsync() to the multi-threaded implementation

Use defaulted destructor for proper cleanup

176-228: Dynamic_cast usage is safe – no action needed

Verified that ConnectionManager is indeed polymorphic (it declares virtual methods, including recvConnect, getConnections, and getCommState), so the dynamic_cast in recvRequestInfo() will behave as intended. The interface is stable and handles both agent and non-agent scenarios correctly.

26-26: LGTM! Updated include for agent utilities.

The include path update aligns with the refactored cache transmission structure.

38-45: LGTM! Clean helper function with proper naming.

The tagFromRequestId function follows proper naming conventions and provides a clean abstraction for generating unique tags from request IDs.

47-57: LGTM! Well-structured resource management.

The ReceiveCacheResource struct properly encapsulates buffer manager and CUDA event resources with appropriate constructor.

134-149: LGTM! Proper async pattern with promise/future.

The sendAsync method correctly implements the promise/future pattern with proper synchronization using mutexes and condition variables.

176-228: LGTM! Comprehensive request info handling with proper validation.

The recvRequestInfo method properly handles both agent and direct connection scenarios, with appropriate validation of cache state compatibility and connection management.

230-237: LGTM! Clean synchronous send implementation.

The sendSync method properly retrieves the session and delegates to the formatter, with appropriate error checking.

405-418: LGTM! Consistent constructor pattern.

The CacheReceiver::Impl constructor follows the same validation and initialization pattern as the sender.

465-529: LGTM! Comprehensive request info sending with agent support.

The sendRequestInfo method properly handles both agent and direct connection scenarios, with appropriate buffer management and connection selection logic.

531-547: LGTM! Thread-safe resource management with proper scoping.

The getReceiveCacheResource method uses proper scoped locking and resource management patterns for concurrent access.

549-559: LGTM! Clean request info serialization.

The sendRequestInfo helper method properly serializes and sends request information using the defined communication protocol.

700-735: LGTM! Clean public interface implementation.

The public interface methods properly delegate to the pImpl pattern implementations with appropriate parameter forwarding and resource management.

736-736: LGTM! Proper namespace closure comment.

The namespace closure comment follows the coding guidelines requirement.

267-345: Thread name and exception handling verified

The thread name “dataTransResp” is only set here and does not conflict with other threads. Exception propagation via set_exception(std::current_exception()) matches the pattern used elsewhere (e.g., in workerPool.h). No further changes are needed.

115-132: Ensure ConnectionManager Access Is Thread-Safe

I didn’t find any explicit synchronization (mutexes/locks) around calls to mManager in this class. Since you spawn a background thread via std::async for Impl::response, concurrent calls to mManager could race unless ConnectionManager itself is fully thread-safe.

Please confirm or address this:

• Constructor (lines 115–132): validates and reads from mManager
• Impl::response (launched async): invokes mManager->recvConnect(...) and other methods
• receiveAsync()/other methods: call mManager->getConnections(...)

If executor::kv_cache::ConnectionManager is not inherently thread-safe, wrap all shared calls in a mutex or otherwise serialize access to mManager.

coderabbitai · 2025-07-25T02:35:41Z

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp

+    [[nodiscard]] size_t getCounterpartsCount(LlmRequest::RequestIdType requestId) const
+    {
+        auto it = mRequestToSession.find(requestId);
+        TLLM_CHECK(it != mRequestToSession.end());
+        return it->second.getConnections().size();
+    }
+
+    void release(LlmRequest::RequestIdType requestId)
+    {
+        auto it = mRequestToSession.find(requestId);
+        TLLM_CHECK(it != mRequestToSession.end());
+        std::unique_lock<std::mutex> lk(mMtxForMap);
+        mRequestToSession.erase(it);
+    }


🛠️ Refactor suggestion

Check thread safety of session map access.

The getCounterpartsCount and release methods access mRequestToSession map. Verify that all map accesses are properly synchronized.

[[nodiscard]] size_t getCounterpartsCount(LlmRequest::RequestIdType requestId) const { + std::unique_lock<std::mutex> lk(mMtxForMap); auto it = mRequestToSession.find(requestId); TLLM_CHECK(it != mRequestToSession.end()); return it->second.getConnections().size(); }

🤖 Prompt for AI Agents

In cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp around lines 161 to 174, the methods getCounterpartsCount and release access the mRequestToSession map without synchronization, which can cause race conditions. To fix this, ensure that all accesses to mRequestToSession are protected by acquiring the mutex mMtxForMap before accessing or modifying the map. Add a std::unique_lock<std::mutex> lock(mMtxForMap) at the start of getCounterpartsCount and confirm the existing lock in release covers the map access properly.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp

tensorrt-cicd · 2025-07-25T02:47:38Z

PR_Github #12929 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #9638 completed with status: 'FAILURE'

Tabrizian · 2025-07-25T19:27:07Z

/bot run

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (2)

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
18-18: Replace #pragma once with required preprocessor guard.

The coding guidelines specifically require preprocessor guards in header files with the format TRTLLM_ followed by the filename in caps.

Replace the current directive with the required guard format:
-#pragma once
+#ifndef TRTLLM_DATATRANSCEIVER_H
+#define TRTLLM_DATATRANSCEIVER_H
And add the closing directive at the end of the file:
 } // namespace tensorrt_llm::batch_manager
+
+#endif // TRTLLM_DATATRANSCEIVER_H
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (1)
225-261: Fix member variable naming convention.

The member variable kvCacheMeasureHelper should follow the coding guidelines for class member variables.

Apply this fix:
-    KvCacheMeasureHelper kvCacheMeasureHelper{common::getEnvKVCacheTransferOutputPath()};
+    KvCacheMeasureHelper mKvCacheMeasureHelper{common::getEnvKVCacheTransferOutputPath()};
Class member variables should use camelCase prefixed with 'm' according to the coding guidelines.

♻️ Duplicate comments (2)

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)
161-174: Thread safety issue with session map access.

The getCounterpartsCount method accesses mRequestToSession without proper synchronization, while release uses mutex protection. This creates a race condition.

This is the same thread safety issue identified in previous reviews. The getCounterpartsCount method needs mutex protection:
 [[nodiscard]] size_t getCounterpartsCount(LlmRequest::RequestIdType requestId) const
 {
+    std::unique_lock<std::mutex> lk(mMtxForMap);
     auto it = mRequestToSession.find(requestId);
     TLLM_CHECK(it != mRequestToSession.end());
     return it->second.getConnections().size();
 }
420-424: TODO: Address inefficient thread creation pattern.

The method creates a new thread for each async receive operation, which is inefficient as noted in the TODO comment.

This matches the previous review feedback about frequent thread creation. The TODO comment correctly identifies this as needing improvement. Consider using the existing requestAndReceiveAsyncMultiThreads pattern or implementing a thread pool.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 90d8774 and 7d3e452.

📒 Files selected for processing (13)

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (2 hunks)
cpp/tensorrt_llm/batch_manager/CMakeLists.txt (0 hunks)
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (3 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (7 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (15 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (3 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (0 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h (0 hunks)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (2 hunks)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (2 hunks)
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp (1 hunks)
cpp/tests/batch_manager/cacheTransceiverTest.cpp (11 hunks)
cpp/tests/unit_tests/executor/ucxCommTest.cpp (0 hunks)

💤 Files with no reviewable changes (4)

cpp/tensorrt_llm/batch_manager/CMakeLists.txt
cpp/tests/unit_tests/executor/ucxCommTest.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp

✅ Files skipped from review due to trivial changes (1)

cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp

🚧 Files skipped from review as they are similar to previous changes (5)

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
cpp/tests/batch_manager/cacheTransceiverTest.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp

🧰 Additional context used

📓 Path-based instructions (3)

**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0, nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and must be case-insensitive unique within a compilation target.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants are uppercase snakecase with prefix...

Files:

cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp

**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h

**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp

🧠 Learnings (3)

cpp/tensorrt_llm/batch_manager/cacheFormatter.h (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🔇 Additional comments (24)

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (6)

1-16: LGTM! Copyright header complies with coding guidelines.

The NVIDIA copyright header includes the current year (2025) as required by the coding guidelines for TensorRT-LLM Open Source Software.

24-24: Good addition of cacheFormatter.h include.

The new include aligns with the refactored design that moves TransferSession definition to the kv_cache_manager namespace.

42-58: Well-designed TransceiverTag structure follows coding guidelines.

The type aliases and TransceiverTag struct properly follow C++ naming conventions:

SizeType32 alias uses camelCase as required

TransferSession alias provides clear abstraction boundary

Enum values use uppercase naming (REQUEST_SEND, TERMINATION)

Static constexpr constants use kCONSTANT_NAME pattern with proper k prefix

115-158: CacheSender class design follows best practices.

The refactored class demonstrates good design principles:

Proper UpperCamelCase naming for class name

Method names use lowerCamelCase convention

Smart pointer usage with unique_ptr for resource management

Pimpl pattern reduces compilation dependencies

Constructor parameters use appropriate types (ConnectionManager*, CacheState, etc.)

160-182: CacheReceiver class maintains design consistency.

The class follows the same excellent design patterns as CacheSender:

Consistent naming conventions and interface design

Symmetric constructor signature with CacheSender

Appropriate method names for receiving operations

Proper use of pimpl pattern for implementation hiding

184-184: Perfect namespace closing comment.

The closing brace comment follows the coding guidelines requirement exactly: } // namespace tensorrt_llm::batch_manager

cpp/tensorrt_llm/batch_manager/cacheFormatter.h (4)

1-16: Copyright header meets coding guidelines.

The NVIDIA copyright header correctly includes the current year (2025) as required for TensorRT-LLM Open Source Software.

45-121: TransferSession class demonstrates excellent C++ design practices.

The class properly follows all coding guidelines:

UpperCamelCase class name

Member variables with proper 'm' prefix (mConnections, mDataContext, etc.)

Method names use lowerCamelCase convention

Proper const-correctness with const methods and const references

Good use of TLLM_CHECK for assertions

Clean initializer list in constructor

123-161: BaseCacheFormatter interface demonstrates clean design.

The interface properly follows coding guidelines:

Method names use lowerCamelCase convention

Proper virtual destructor

Clean signature with TransferSession& parameter

Appropriate const-correctness

Good use of [[nodiscard]] attribute for return values

266-266: Perfect namespace closing comment.

The closing brace comment properly follows coding guidelines with the complete namespace path: } // namespace tensorrt_llm::batch_manager::kv_cache_manager

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (14)

1-16: Copyright header complies with coding guidelines.

The NVIDIA copyright header correctly includes the current year (2025) as required for TensorRT-LLM Open Source Software.

38-57: Well-designed utility functions and structures.

The code follows C++ coding guidelines properly:

tagFromRequestId function uses lowerCamelCase and static linkage

ReceiveCacheResource struct uses UpperCamelCase naming

Member variables correctly use 'm' prefix (mBufferManager, mCudaEvent)

Constant kDATA_TAG follows uppercase with 'k' prefix convention

115-133: CacheSender::Impl constructor follows best practices.

The constructor demonstrates good C++ practices:

Proper member initialization list

Smart pointer usage with unique_ptr

Appropriate validation with TLLM_CHECK

Member variables follow 'm' prefix convention

176-228: Complex but well-structured request info handling.

The method demonstrates good practices:

Proper type checking with dynamic_cast for agent connections

Clean lambda usage for agent-specific logic

Correct mutex synchronization for mRequestToSession access

Good validation of formatter compatibility between cache states

Appropriate session creation and connection management

230-237: Clean and focused sendSync implementation.

The method properly delegates to formatter and manages the session state appropriately.

251-403: Robust threading and response management.

The response handling demonstrates excellent threading practices:

Proper exception handling with promise/future pattern

Correct CUDA device management across threads

Thread naming for debugging (dataTransResp)

Proper condition variable usage for synchronization

Clean termination handling in destructor

405-418: CacheReceiver::Impl constructor follows consistent design.

The constructor maintains the same excellent practices as CacheSender::Impl with proper initialization and validation.

426-458: Excellent async request handling with thread pool pattern.

This method demonstrates the proper solution for efficient async operations:

Resource pooling per process to avoid thread creation overhead

Queue-based task distribution with worker threads

Proper synchronization with mutex and condition variables

Good exception handling and future management

This addresses the inefficiency noted in the receiveAsync TODO

460-463: Clean receiveSync delegation.

Simple and appropriate delegation to the formatter's unformat method.

465-559: Comprehensive sendRequestInfo implementation.

The method handles complex scenarios well:

Proper validation of formatter support between cache states

Clean handling of both agent and regular connection managers

Appropriate serialization of request information

Good resource management with TransferSession creation

Proper exception handling and validation throughout

531-559: Solid resource management with thread safety.

The resource management demonstrates good practices:

Thread-safe access with scoped_lock<mutex>

Lazy initialization of per-process resources

Proper RAII with unique_ptr and BufferManager

Clean helper method for connection communication

561-698: Excellent async resource management and cleanup.

The implementation demonstrates robust design:

Proper RAII in destructor with resource cleanup and future synchronization

Good CUDA device management and logging in requestSync

Well-designed AsyncResource structure for thread coordination

Proper move semantics in RequestAndPromise

Comprehensive error handling with exception propagation

Thread naming for debugging (dataTransRequest)

700-734: Clean public interface following pimpl pattern.

The public interface demonstrates excellent design:

Proper delegation to implementation classes

Clean constructor forwarding with smart pointer management

Appropriate method delegation maintaining interface contracts

Correctly defaulted destructors for pimpl pattern

736-737: Perfect namespace closing comment.

The closing brace comment properly follows coding guidelines: } // namespace tensorrt_llm::batch_manager

coderabbitai · 2025-07-25T19:31:57Z

cpp/tensorrt_llm/batch_manager/cacheFormatter.h

@@ -18,11 +18,11 @@
 #pragma once


🛠️ Refactor suggestion

Replace #pragma once with required preprocessor guard.

The coding guidelines mandate preprocessor guards in header files with the format TRTLLM_ followed by the filename in caps.

Replace with the required guard format:

-#pragma once +#ifndef TRTLLM_CACHEFORMATTER_H +#define TRTLLM_CACHEFORMATTER_H

And add the closing directive at the end of the file:

} // namespace tensorrt_llm::batch_manager::kv_cache_manager + +#endif // TRTLLM_CACHEFORMATTER_H

🤖 Prompt for AI Agents

In cpp/tensorrt_llm/batch_manager/cacheFormatter.h at line 18, replace the #pragma once directive with a preprocessor guard following the coding guidelines. Use the format TRTLLM_CACHEFORMATTER_H for the guard macro. Add #ifndef TRTLLM_CACHEFORMATTER_H and #define TRTLLM_CACHEFORMATTER_H at the top of the file, and add the corresponding #endif at the end of the file to close the guard.

coderabbitai · 2025-07-25T19:31:57Z

cpp/tensorrt_llm/batch_manager/cacheFormatter.h

+class KvCacheMeasureHelper
+{
+public:
+    KvCacheMeasureHelper(std::string output_path)
+        : mOutputPath(std::move(output_path))
+    {
+    }
+
+    void appendKVCacheTransfer(LlmRequest::RequestIdType requestId, double duration, size_t size)
+    {
+        auto bandwidth = size * 8 / (duration / 1000) / 1e9;
+        if (mOutputPath.empty())
+        {
+            return;
+        }
+
+        std::lock_guard<std::mutex> lock(mMutex);
+        mRequestKVCacheTranfserMeasure[requestId].emplace_back(duration, bandwidth);
+    }
+
+    ~KvCacheMeasureHelper()
+    {
+        if (!mRequestKVCacheTranfserMeasure.empty() && !mOutputPath.empty())
+        {
+            auto rank = mpi::MpiComm::world().getRank();
+            std::string outFilePath = mOutputPath + "rank_" + std::to_string(rank) + ".txt";
+            std::ofstream outFile(outFilePath);
+
+            TLLM_CHECK_WITH_INFO(outFile.is_open(), "Cannot write to file " + outFilePath);
+
+            size_t numTransferMeasure = mRequestKVCacheTranfserMeasure.begin()->second.size();
+
+            outFile << "RequestID";
+            for (size_t i = 0; i < numTransferMeasure; i++)
+            {
+                outFile << ",TimeDuration,Bandwidth";
+            }
+            outFile << '\n';
+
+            for (auto const& [requestID, measures] : mRequestKVCacheTranfserMeasure)
+            {
+                outFile << requestID;
+
+                for (auto const& [time, bandwidth] : measures)
+                {
+                    outFile << "," << time << "," << bandwidth;
+                }
+                outFile << '\n';
+            }
+
+            outFile.close();
+        }
+    }
+
+private:
+    std::map<LlmRequest::RequestIdType, std::vector<std::pair<double, double>>> mRequestKVCacheTranfserMeasure;
+    std::string mOutputPath;
+    std::mutex mMutex;
+};


🛠️ Refactor suggestion

Fix typo in member variable name and improve parameter naming.

The class implementation is good but has minor naming issues:

Apply these fixes:

-class KvCacheMeasureHelper -{ -public: - KvCacheMeasureHelper(std::string output_path) - : mOutputPath(std::move(output_path)) +class KvCacheMeasureHelper +{ +public: + KvCacheMeasureHelper(std::string outputPath) + : mOutputPath(std::move(outputPath))

- std::map<LlmRequest::RequestIdType, std::vector<std::pair<double, double>>> mRequestKVCacheTranfserMeasure; + std::map<LlmRequest::RequestIdType, std::vector<std::pair<double, double>>> mRequestKVCacheTransferMeasure;

And update the corresponding usage in the destructor.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In cpp/tensorrt_llm/batch_manager/cacheFormatter.h between lines 163 and 221, fix the typo in the member variable name from mRequestKVCacheTranfserMeasure to mRequestKVCacheTransferMeasure consistently throughout the class, including in appendKVCacheTransfer and the destructor. Also, rename the parameters in appendKVCacheTransfer to more descriptive names such as requestId, durationMs, and sizeBytes for clarity. Update all usages of these variables accordingly.

tensorrt-cicd · 2025-07-25T19:32:32Z

PR_Github #13033 [ run ] triggered by Bot

tensorrt-cicd · 2025-07-25T21:16:35Z

PR_Github #13033 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9736 completed with status: 'FAILURE'

Signed-off-by: Iman Tabrizian <[email protected]>

coderabbitai

Actionable comments posted: 8

🔭 Outside diff range comments (1)

cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)
17-17: Replace #pragma once with required preprocessor guard.

Header files in this project must use include guards named TRTLLM_<FILENAME>_H without underscores prefix/suffix.

Replace with the required guard format:
-#pragma once
+#ifndef TRTLLM_KVCACHEUTILS_H
+#define TRTLLM_KVCACHEUTILS_H
And add the closing directive at the end of the file:
 } // namespace tensorrt_llm::batch_manager::kv_cache_manager
+
+#endif // TRTLLM_KVCACHEUTILS_H

♻️ Duplicate comments (4)

cpp/tensorrt_llm/batch_manager/cacheFormatter.h (2)
18-18: Replace #pragma once with required preprocessor guard.

The coding guidelines mandate preprocessor guards in header files with the format TRTLLM_ followed by the filename in caps.

Replace with the required guard format:
-#pragma once
+#ifndef TRTLLM_CACHEFORMATTER_H
+#define TRTLLM_CACHEFORMATTER_H
And add the closing directive at the end of the file:
 } // namespace tensorrt_llm::batch_manager::kv_cache_manager
+
+#endif // TRTLLM_CACHEFORMATTER_H
111-111: Fix typo in member variable name.

The member variable name contains a typo: "Tranfser" should be "Transfer".

Apply this fix:
-    std::map<LlmRequest::RequestIdType, std::vector<Measure>> mRequestKVCacheTranfserMeasure;
+    std::map<LlmRequest::RequestIdType, std::vector<Measure>> mRequestKVCacheTransferMeasure;
Also update the usage in line 71:
-        mRequestKVCacheTranfserMeasure[requestId].emplace_back(Measure{delay, duration, bandwidth});
+        mRequestKVCacheTransferMeasure[requestId].emplace_back(Measure{delay, duration, bandwidth});
And in the destructor (lines 76 and 86):
-        if (!mRequestKVCacheTranfserMeasure.empty() && !mOutputPath.empty())
+        if (!mRequestKVCacheTransferMeasure.empty() && !mOutputPath.empty())
-            size_t numTransferMeasure = mRequestKVCacheTranfserMeasure.begin()->second.size();
+            size_t numTransferMeasure = mRequestKVCacheTransferMeasure.begin()->second.size();
And line 95:
-            for (auto const& [requestID, measures] : mRequestKVCacheTranfserMeasure)
+            for (auto const& [requestID, measures] : mRequestKVCacheTransferMeasure)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)
161-174: Thread safety issue in getCounterpartsCount method.

The method accesses mRequestToSession without acquiring the mutex, which could cause race conditions with concurrent modifications.
 [[nodiscard]] size_t getCounterpartsCount(LlmRequest::RequestIdType requestId) const
 {
+    std::unique_lock<std::mutex> lk(mMtxForMap);
     auto it = mRequestToSession.find(requestId);
     TLLM_CHECK(it != mRequestToSession.end());
     return it->second.getConnections().size();
 }
430-434: TODO: Address frequent thread creation issue.

The current implementation creates a new thread for each async receive operation, which is inefficient as noted in the TODO comment.

Consider using the existing requestAndReceiveAsyncMultiThreads method or implementing a thread pool:
 [[nodiscard]] std::future<void> receiveAsync(LlmRequest& llmRequest)
 {
-    // TODO: Modify the implementation here to avoid frequent thread creation.
-    return std::async(std::launch::async, &CacheReceiver::Impl::requestSync, this, std::ref(llmRequest));
+    return requestAndReceiveAsyncMultiThreads(llmRequest);
 }

🧹 Nitpick comments (4)

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)

42-44: Consider improving function signature consistency.

The function now accepts blockHashes but internally converts them to block IDs in line 55. Consider either:

Renaming the parameter to better reflect its dual purpose

Accepting block IDs directly if the caller already has them

Consider this alternative approach for better clarity:
 BlockRange getBlockRangeForSending(
-    BaseKVCacheManager* cacheManager, LlmRequest const& llmRequest, std::vector<size_t> const& blockHashes)
+    BaseKVCacheManager* cacheManager, LlmRequest const& llmRequest, std::vector<size_t> const& blockHashesOrIds)

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

1-2: Update copyright year to 2025 for new methods.

The file header indicates copyright ending in 2024, but the changes are being made in 2025. Consider updating the copyright year to reflect when these new methods are being added.

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

45-56: Consider making tag constants more maintainable.

The magic numbers for tags (19, 22, 32) should be documented or derived from a base value to make their purpose clearer.

 struct TransceiverTag
 {
     enum class Id : uint64_t
     {
         REQUEST_SEND = 1,
         TERMINATION = 2
     };
 
-    static constexpr int32_t kID_TAG{19};
-    static constexpr int32_t kINFO_SIZE_TAG{22};
-    static constexpr int32_t kINFO_TAG{32};
+    // Base tag for transceiver operations
+    static constexpr int32_t kBASE_TAG{19};
+    static constexpr int32_t kID_TAG{kBASE_TAG};
+    static constexpr int32_t kINFO_SIZE_TAG{kBASE_TAG + 3};
+    static constexpr int32_t kINFO_TAG{kBASE_TAG + 13};
 };

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

475-537: Complex connection management in sendRequestInfo.

The method handles both regular and agent-based connections with extensive branching. Consider extracting the agent-specific logic to a separate method for better maintainability.

 TransferSession sendRequestInfo(LlmRequest const& llmRequest)
 {
     // ... initial setup ...
     
-    auto* agentConnectionManager = dynamic_cast<executor::kv_cache::AgentConnectionManager*>(mManager);
-    std::optional<size_t> cacheBufferId = std::nullopt;
-    if (agentConnectionManager != nullptr)
-    {
-        cacheBufferId = agentConnectionManager->getCacheTransBufferManager()->assignBufferIndexForRecv();
-        TLLM_CHECK(cacheBufferId.has_value());
-        // memory Desp , validSegmentIdx send
-    }
+    auto cacheBufferId = prepareAgentCacheBuffer();
     
     // ... connection setup ...
     
     for (size_t i = 0; i < counterPartConnections.size(); i++)
     {
         auto const* connection = counterPartConnections[i];
-        // if Manager is agentConnectionManager, then send request info to agent
-        auto* agentConnectionManager = dynamic_cast<executor::kv_cache::AgentConnectionManager*>(mManager);
-        if (agentConnectionManager != nullptr)
-        {
-            // TODO: index -> validConnectionIdx conversion
-            auto valideConnectionIdx = std::find(pickUpIdx.begin(), pickUpIdx.end(), i) - pickUpIdx.begin();
-            auto* agentConnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(connection);
-            TLLM_CHECK(agentConnection != nullptr);
-            TLLM_CHECK(cacheBufferId.has_value());
-            const_cast<executor::kv_cache::AgentConnection*>(agentConnection)
-                ->sendRequestAndBufferInfo(requestInfo, cacheBufferId, valideConnectionIdx);
-        }
-        else
-        {
-            sendRequestInfo(connection, requestInfo);
-        }
+        sendRequestInfoToConnection(connection, requestInfo, cacheBufferId, pickUpIdx, i);
     }
     // ... rest of method ...
 }

+private:
+    std::optional<size_t> prepareAgentCacheBuffer()
+    {
+        auto* agentManager = dynamic_cast<executor::kv_cache::AgentConnectionManager*>(mManager);
+        if (agentManager != nullptr)
+        {
+            auto bufferId = agentManager->getCacheTransBufferManager()->assignBufferIndexForRecv();
+            TLLM_CHECK(bufferId.has_value());
+            return bufferId;
+        }
+        return std::nullopt;
+    }
+
+    void sendRequestInfoToConnection(executor::kv_cache::Connection const* connection,
+        RequestInfo const& requestInfo, std::optional<size_t> cacheBufferId,
+        std::vector<size_t> const& pickUpIdx, size_t connectionIdx)
+    {
+        if (cacheBufferId.has_value())
+        {
+            auto validIdx = std::find(pickUpIdx.begin(), pickUpIdx.end(), connectionIdx) - pickUpIdx.begin();
+            auto* agentConn = dynamic_cast<executor::kv_cache::AgentConnection const*>(connection);
+            TLLM_CHECK(agentConn != nullptr);
+            const_cast<executor::kv_cache::AgentConnection*>(agentConn)
+                ->sendRequestAndBufferInfo(requestInfo, cacheBufferId, validIdx);
+        }
+        else
+        {
+            sendRequestInfo(connection, requestInfo);
+        }
+    }

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7d3e452 and 1a21dba.

📒 Files selected for processing (18)

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (2 hunks)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (5 hunks)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1 hunks)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (0 hunks)
cpp/tensorrt_llm/batch_manager/CMakeLists.txt (0 hunks)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (3 hunks)
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (2 hunks)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (7 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (12 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (3 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (0 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h (0 hunks)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3 hunks)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2 hunks)
cpp/tensorrt_llm/common/envUtils.cpp (0 hunks)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (2 hunks)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (2 hunks)
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp (1 hunks)

💤 Files with no reviewable changes (5)

cpp/tensorrt_llm/batch_manager/CMakeLists.txt
cpp/tensorrt_llm/common/envUtils.cpp
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp

🚧 Files skipped from review as they are similar to previous changes (5)

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp

🧰 Additional context used

📓 Path-based instructions (5)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h

**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h

**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.h
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h

**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp

🧠 Learnings (3)

📚 Learning: 2025-08-06T08:18:28.669Z

Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.h

📚 Learning: 2025-08-12T10:28:57.287Z

Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.287Z
Learning: Applies to **/*.{h,hpp,hxx,hh,cuh} : Header files must use include guards named TRTLLM_<FILENAME>_H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Applied to files:

cpp/tensorrt_llm/batch_manager/cacheFormatter.h

📚 Learning: 2025-08-08T05:06:31.537Z

Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

cpp/tensorrt_llm/batch_manager/cacheFormatter.h

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (12)

cpp/tensorrt_llm/batch_manager/cacheFormatter.h (1)

52-53: Constructor parameter name should follow coding guidelines.

Parameter names should use lowerCamelCase without underscores according to the coding guidelines.

Apply this fix:
-    KvCacheMeasureHelper(std::string output_path)
-        : mOutputPath(std::move(output_path))
+    KvCacheMeasureHelper(std::string outputPath)
+        : mOutputPath(std::move(outputPath))
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)

96-97: LGTM! Correctly integrates session-based block hash transfer.

The implementation properly retrieves block hashes from the session and passes them to the updated getBlockRangeForSending function, aligning with the new session-based transfer architecture.

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)

1011-1015: LGTM! Properly implements block hash lookup in reuse tree.

The implementation correctly delegates to the cached blocks root to find matching blocks by hash, supporting the partial reuse and copy-on-partial-reuse configurations.

1468-1474: LGTM! Clean implementation of block pinning across window managers.

The implementation correctly iterates through all window managers to pin blocks for the given sequence, properly propagating the pinning request throughout the hierarchy.

1476-1484: LGTM! Correctly increments reference counts for pinned blocks.

The implementation properly pins all allocated blocks for a sequence by incrementing their reference counts, which will prevent them from being evicted.

2117-2121: LGTM! Clean public API for block pinning.

The implementation provides a simple and clean interface for pinning blocks by request ID, properly retrieving the sequence and delegating to the block manager.

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)

158-159: LGTM! Correctly integrates block hashes from session.

The implementation properly retrieves block hashes from the session and passes them to the updated getBlockRangeForSending function.

49-56: Selective transfer gating intentionally replaced
The DisableSelectiveCacheTransfer env‐var check has been removed and gating is now driven by poolNum and isEnableBlockReuse(). We confirmed:

getEnvDisableSelectiveCacheTransfer() is only declared (no usages)

In getBlockRangeForSending:
• If poolNum>1 or !isEnableBlockReuse(), all blocks are sent
• Otherwise (single pool & reuse enabled), only the requested blockHashes are sent

The receiving side (getBlockRangeForReceiving) uses the same single‐pool/reuse condition for symmetry

No changes are needed—the new logic is intentional and covers all scenarios.

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (2)

1-16: Copyright header updated to 2025 - looks good!

The copyright header correctly reflects the 2025 date for this new implementation.

115-158: Well-structured PIMPL implementation for CacheSender.

The class follows the PIMPL idiom correctly with clear method signatures and proper const-correctness. The documentation is comprehensive.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)

240-277: Good error handling in sendAndRemoveResponse.

The method properly handles exceptions and ensures promises are fulfilled or rejected appropriately. The noexcept specification is correct since exceptions are caught internally.

539-555: Resource management looks solid.

The getReceiveCacheResource method properly manages per-process resources with appropriate synchronization and lazy initialization.

coderabbitai · 2025-08-12T20:16:58Z

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

+    void pinBlocks(GenerationRequest& sequence);
+


🛠️ Refactor suggestion

Add documentation for the new pinBlocks method.

The new pinBlocks method lacks documentation explaining its purpose, behavior, and when it should be called. This is particularly important as it appears to be part of the public API.

Add documentation before the method declaration:

+ //! \brief Pin blocks associated with a sequence to prevent eviction. + //! \param sequence The generation request whose blocks should be pinned. + //! \details This method marks blocks as pinned in the KV cache to ensure they + //! remain available for reuse across multiple requests. void pinBlocks(GenerationRequest& sequence);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

void pinBlocks(GenerationRequest& sequence);

//! \brief Pin blocks associated with a sequence to prevent eviction.

//! \param sequence The generation request whose blocks should be pinned.

//! \details This method marks blocks as pinned in the KV cache to ensure they

//! remain available for reuse across multiple requests.

void pinBlocks(GenerationRequest& sequence);

🤖 Prompt for AI Agents

In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around lines 569-570, the public method declaration void pinBlocks(GenerationRequest& sequence); is missing documentation; add a brief doxygen-style comment immediately above the declaration describing purpose (what pinning does), expected behavior (which blocks are pinned, lifetime and thread-safety expectations), when callers should invoke it (e.g., before generation, after allocation), parameters (explain GenerationRequest reference and ownership/constness), return/exception behavior (whether it can fail, how failures are reported), and any side effects (impact on memory and eviction). Ensure the comment is concise, uses @param and @throws/@note as appropriate, and matches the style of surrounding method docs in this header.

coderabbitai · 2025-08-12T20:16:59Z

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

+    [[nodiscard]] WindowBlockManager::BlockMapIterRange getBlocksByHash(size_t hash, SizeType32 windowSize) const
+    {
+        return mWindowBlockManagers.at(windowSize).getBlocksByHash(hash);
+    }
+
+    [[nodiscard]] BlockPtr const& getBlockFromReuseTreeByHash(size_t hash, SizeType32 windowSize) const
+    {
+        return mWindowBlockManagers.at(windowSize).getBlockFromReuseTreeByHash(hash);
+    }
+


⚠️ Potential issue

Add bounds checking and documentation for new hash-based lookup methods.

The new hash-based lookup methods getBlocksByHash and getBlockFromReuseTreeByHash access the window manager map without bounds checking and lack documentation.

+ //! \brief Get blocks matching the given hash for a specific window size. + //! \param hash The hash value to search for. + //! \param windowSize The window size to search within. + //! \return Iterator range over matching blocks. [[nodiscard]] WindowBlockManager::BlockMapIterRange getBlocksByHash(size_t hash, SizeType32 windowSize) const { + auto it = mWindowBlockManagers.find(windowSize); + TLLM_CHECK_WITH_INFO(it != mWindowBlockManagers.end(), + "Window size %d not found in window block managers", windowSize); - return mWindowBlockManagers.at(windowSize).getBlocksByHash(hash); + return it->second.getBlocksByHash(hash); } + //! \brief Get a block from the reuse tree matching the given hash. + //! \param hash The hash value to search for. + //! \param windowSize The window size to search within. + //! \return Reference to the matching block pointer. [[nodiscard]] BlockPtr const& getBlockFromReuseTreeByHash(size_t hash, SizeType32 windowSize) const { + auto it = mWindowBlockManagers.find(windowSize); + TLLM_CHECK_WITH_INFO(it != mWindowBlockManagers.end(), + "Window size %d not found in window block managers", windowSize); - return mWindowBlockManagers.at(windowSize).getBlockFromReuseTreeByHash(hash); + return it->second.getBlockFromReuseTreeByHash(hash); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

[[nodiscard]] WindowBlockManager::BlockMapIterRange getBlocksByHash(size_t hash, SizeType32 windowSize) const

{

return mWindowBlockManagers.at(windowSize).getBlocksByHash(hash);

}

[[nodiscard]] BlockPtr const& getBlockFromReuseTreeByHash(size_t hash, SizeType32 windowSize) const

{

return mWindowBlockManagers.at(windowSize).getBlockFromReuseTreeByHash(hash);

}

//! \brief Get blocks matching the given hash for a specific window size.

//! \param hash The hash value to search for.

//! \param windowSize The window size to search within.

//! \return Iterator range over matching blocks.

[[nodiscard]] WindowBlockManager::BlockMapIterRange getBlocksByHash(size_t hash, SizeType32 windowSize) const

{

auto it = mWindowBlockManagers.find(windowSize);

TLLM_CHECK_WITH_INFO(it != mWindowBlockManagers.end(),

"Window size %d not found in window block managers", windowSize);

return it->second.getBlocksByHash(hash);

}

//! \brief Get a block from the reuse tree matching the given hash.

//! \param hash The hash value to search for.

//! \param windowSize The window size to search within.

//! \return Reference to the matching block pointer.

[[nodiscard]] BlockPtr const& getBlockFromReuseTreeByHash(size_t hash, SizeType32 windowSize) const

{

auto it = mWindowBlockManagers.find(windowSize);

TLLM_CHECK_WITH_INFO(it != mWindowBlockManagers.end(),

"Window size %d not found in window block managers", windowSize);

return it->second.getBlockFromReuseTreeByHash(hash);

}

🤖 Prompt for AI Agents

In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around lines 1074-1083, the two new methods call mWindowBlockManagers.at(windowSize) without validating that windowSize exists and they lack documentation; update both methods to first check that mWindowBlockManagers contains windowSize and handle the missing-key case gracefully (e.g., return an empty BlockMapIterRange for getBlocksByHash and return a safe null-equivalent BlockPtr reference or throw a clear std::out_of_range with contextual message for getBlockFromReuseTreeByHash), and add Doxygen-style comments above each method describing the purpose, parameters, return value, and exception behavior so callers know what to expect.

coderabbitai · 2025-08-12T20:16:59Z

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

+    virtual void pinBlocks(LlmRequest::RequestIdType requestId) = 0;
+


🛠️ Refactor suggestion

Document the pure virtual pinBlocks method in the base class.

The pure virtual pinBlocks method in BaseKVCacheManager lacks documentation explaining the contract that derived classes must fulfill.

+ //! \brief Pin blocks associated with a request to prevent eviction. + //! \param requestId The ID of the request whose blocks should be pinned. + //! \details Implementations should mark all blocks associated with the request + //! as pinned to ensure they remain available for potential reuse. virtual void pinBlocks(LlmRequest::RequestIdType requestId) = 0;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

virtual void pinBlocks(LlmRequest::RequestIdType requestId) = 0;

//! \brief Pin blocks associated with a request to prevent eviction.

//! \param requestId The ID of the request whose blocks should be pinned.

//! \details Implementations should mark all blocks associated with the request

//! as pinned to ensure they remain available for potential reuse.

virtual void pinBlocks(LlmRequest::RequestIdType requestId) = 0;

🤖 Prompt for AI Agents

In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around lines 1218-1219, the pure virtual method `pinBlocks(LlmRequest::RequestIdType requestId)` in BaseKVCacheManager is undocumented; add a brief doxygen-style comment above the declaration that states the method's contract: what pinning means, when it will be called, expected behavior (e.g., mark blocks for the given requestId as pinned to prevent eviction), whether it must be thread-safe, return/exception expectations (void, should not throw), and any required side effects or preconditions so derived classes implement consistent behavior.

coderabbitai · 2025-08-12T20:16:59Z

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

+    void pinBlocks(LlmRequest::RequestIdType requestId);
+


⚠️ Potential issue

Missing implementation declaration for pinBlocks override.

The pinBlocks method is declared without the override keyword, which could lead to subtle bugs if the signature doesn't match the base class exactly.

- void pinBlocks(LlmRequest::RequestIdType requestId); + void pinBlocks(LlmRequest::RequestIdType requestId) override;

🤖 Prompt for AI Agents

In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around lines 1607-1608, the method declaration for pinBlocks is missing the override specifier; update the declaration to match and explicitly mark it as overriding the base class (e.g. change to "void pinBlocks(LlmRequest::RequestIdType requestId) override;"), and verify the signature (const/&, noexcept) exactly matches the base class method so the override is valid.

coderabbitai · 2025-08-12T20:16:59Z

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

-        // TODO(zhengd): pass the hashes directly instead of from llmRequest; use hash instead of block num
-        auto const& ids = blockRange.getBlockIds();
-        blockRange.setBlockIds({ids.end() - requestBlockNum, ids.end()});
+        blockRange.setBlockIds(blockHashes);


⚠️ Potential issue

Type mismatch: setBlockIds expects vector but receives vector<size_t>.

The blockRange.setBlockIds() method expects std::vector<SizeType32> according to kvCacheUtils.h, but blockHashes is of type std::vector<size_t>. This could cause issues on platforms where size_t and SizeType32 have different sizes.

Apply this fix to ensure type safety:

if (requestBlockNum < blockRange.size() && requestBlockNum > 0) { - blockRange.setBlockIds(blockHashes); + blockRange.setBlockIdsFromHashes(blockHashes); }

This uses the correct method that accepts std::vector<size_t> for hashes.

🤖 Prompt for AI Agents

In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp around line 55, blockRange.setBlockIds(blockHashes) passes a std::vector<size_t> to a function expecting std::vector<SizeType32>; replace the call with the overload that accepts std::vector<size_t> (e.g., blockRange.setBlockHashes(blockHashes)) so the types match and avoid unsafe casts or platform-dependent size mismatches.

cpp/tensorrt_llm/batch_manager/cacheFormatter.h

coderabbitai · 2025-08-12T20:16:59Z

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp

+    void release(LlmRequest::RequestIdType requestId)
+    {
+        auto it = mRequestToSession.find(requestId);
+        TLLM_CHECK(it != mRequestToSession.end());
+        std::unique_lock<std::mutex> lk(mMtxForMap);
+        mRequestToSession.erase(it);
+    }


⚠️ Potential issue

Lock acquired too late in release method.

The iterator is obtained before acquiring the lock, which could lead to iterator invalidation if another thread modifies the map.

void release(LlmRequest::RequestIdType requestId) { + std::unique_lock<std::mutex> lk(mMtxForMap); auto it = mRequestToSession.find(requestId); TLLM_CHECK(it != mRequestToSession.end()); - std::unique_lock<std::mutex> lk(mMtxForMap); mRequestToSession.erase(it); }

🤖 Prompt for AI Agents

In cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp around lines 168-174, the code calls mRequestToSession.find(requestId) before acquiring mMtxForMap, which can allow the map to be modified and invalidate the iterator; move the lock acquisition to before the find, perform the lookup while holding the mutex, check the iterator under the lock (TLLM_CHECK or equivalent), then erase under the same lock (use std::unique_lock<std::mutex> or std::lock_guard<std::mutex> to hold mMtxForMap for the duration of find + check + erase).

coderabbitai · 2025-08-12T20:16:59Z

cpp/tensorrt_llm/batch_manager/dataTransceiver.h

+    /// @return The request information.
+    [[nodiscard]] RequestInfo recvRequestInfo();


🛠️ Refactor suggestion

Inconsistent method signatures between CacheSender and CacheReceiver.

The sendRequestInfo and receiveSync methods appear in both classes but with different purposes, which could be confusing:

CacheSender has sendRequestInfo returning TransferSession and receiveSync taking TransferSession&

CacheReceiver has the same methods with similar signatures

This naming overlap could lead to confusion about which class is responsible for what operations.

Consider renaming to clarify the direction of data flow:

class CacheSender { // ... - [[nodiscard]] TransferSession sendRequestInfo(LlmRequest const& llmRequest); - void receiveSync(TransferSession& session); + [[nodiscard]] TransferSession prepareSendSession(LlmRequest const& llmRequest); + void completeSendSession(TransferSession& session); }; class CacheReceiver { // ... - TransferSession sendRequestInfo(LlmRequest const& llmRequest); - void receiveSync(TransferSession& session); + TransferSession prepareReceiveSession(LlmRequest const& llmRequest); + void completeReceiveSession(TransferSession& session); };

Also applies to: 173-175

Signed-off-by: Iman Tabrizian <[email protected]>

coderabbitai

Actionable comments posted: 8

♻️ Duplicate comments (8)

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

144-151: Inconsistent method signatures between CacheSender and CacheReceiver.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (3)

161-166: Check thread safety of session map access.

168-174: Lock acquired too late in release method.

430-434: Address TODO: avoid frequent thread creation.
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (4)
580-581: Add documentation for the new pinBlocks method.

1221-1222: Document the pure virtual pinBlocks method in the base class.

1610-1611: Missing implementation declaration for pinBlocks override.

901-901: Document the pinBlocks method in BlockManager.

The pinBlocks method in BlockManager needs documentation.

Add documentation:
+    /// @brief Pin blocks associated with a generation request to prevent eviction.
+    /// @param sequence The generation request whose blocks should be pinned.
     void pinBlocks(GenerationRequest& sequence);

🧹 Nitpick comments (1)

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)
45-56: Consider making the tag constants more maintainable.

The TransceiverTag struct with enum Id and the tag constants looks good. However, the hardcoded integer values (19, 22, 32) for tags might benefit from documentation or being derived from the enum values to avoid magic numbers.

Consider adding a comment explaining the choice of these specific tag values:
 struct TransceiverTag
 {
     enum class Id : uint64_t
     {
         REQUEST_SEND = 1,
         TERMINATION = 2
     };

+    // Tag values for MPI/communication - chosen to avoid conflicts with other subsystems
     static constexpr int32_t kID_TAG{19};
     static constexpr int32_t kINFO_SIZE_TAG{22};
     static constexpr int32_t kINFO_TAG{32};
 };

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a21dba and bd2757d.

📒 Files selected for processing (13)

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (8 hunks)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1 hunks)
cpp/include/tensorrt_llm/batch_manager/llmRequest.h (0 hunks)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (3 hunks)
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (2 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (12 hunks)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (3 hunks)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3 hunks)
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2 hunks)
cpp/tensorrt_llm/common/envUtils.cpp (0 hunks)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (2 hunks)
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1 hunks)
tensorrt_llm/_torch/pyexecutor/py_executor.py (3 hunks)

💤 Files with no reviewable changes (2)

cpp/tensorrt_llm/common/envUtils.cpp
cpp/include/tensorrt_llm/batch_manager/llmRequest.h

🚧 Files skipped from review as they are similar to previous changes (5)

cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
cpp/tensorrt_llm/batch_manager/cacheFormatter.h

🧰 Additional context used

📓 Path-based instructions (6)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
tensorrt_llm/_torch/pyexecutor/py_executor.py
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.h
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

tensorrt_llm/_torch/pyexecutor/py_executor.py

**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
cpp/tensorrt_llm/batch_manager/dataTransceiver.h

🧠 Learnings (1)

📚 Learning: 2025-08-06T08:18:28.669Z

Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
cpp/tensorrt_llm/batch_manager/dataTransceiver.h

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (10)

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

42-44: LGTM! Type aliases are well-defined.

The new type aliases SizeType32, TransferSession, and BaseCacheFormatter are correctly defined and improve code readability.

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1)

348-348: LGTM! Python binding for pinBlocks is correctly added.

The binding for pinBlocks is properly placed after scheduling_remove_sequence and correctly exposes the C++ method to Python as pin_blocks.

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

183-183: LGTM! Block reuse flag properly initialized.

The block_reuse_enabled attribute is correctly initialized from the model engine's KV cache configuration.

1641-1645: LGTM! Proper handling of block pinning based on reuse flag.

The conditional logic correctly pins blocks when block_reuse_enabled is True and maintains backward compatibility when it's False.

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)

118-119: LGTM! Hash member added to BlockKey.

The addition of the hash member to BlockKey is a good optimization to avoid recomputing hashes.

174-178: Good optimization for hash caching.

The modification to return the cached hash when available is an efficient optimization.

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)

1-16: LGTM: Properly formatted copyright header.

The NVIDIA copyright header is correctly formatted with the current year (2025) and follows the Apache 2.0 license template as required by the coding guidelines.

1485-1491: LGTM: Well-structured pin blocks delegation.

The function properly delegates pinning operations to all window block managers following the established pattern used throughout the codebase.

1493-1501: LGTM: Correct implementation of block pinning.

The function correctly increases the reference count for all allocated blocks belonging to the sequence, which prevents them from being evicted. This aligns with the intended pinning mechanism.

2134-2138: LGTM: Clean API for KV cache level block pinning.

The function provides a clean interface for pinning blocks at the KVCacheManager level by resolving the sequence from requestId and delegating to the BlockManager. This follows good encapsulation principles.

coderabbitai · 2025-08-13T00:37:30Z

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

+    explicit BlockKey(size_t hash)
+        : hash{hash}
+    {
+    }


🛠️ Refactor suggestion

Document the purpose of the hash-only constructor.

The constructor that takes only a hash value needs documentation explaining when it should be used.

Add documentation:

+ /// @brief Construct a BlockKey with a precomputed hash. + /// @param hash The precomputed hash value. + /// @details Used when retrieving blocks by hash from the reuse tree. explicit BlockKey(size_t hash) : hash{hash} { }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

explicit BlockKey(size_t hash)

: hash{hash}

{

}

/// @brief Construct a BlockKey with a precomputed hash.

/// @param hash The precomputed hash value.

/// @details Used when retrieving blocks by hash from the reuse tree.

explicit BlockKey(size_t hash)

: hash{hash}

{

}

🤖 Prompt for AI Agents

In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around lines 132 to 135, add a documentation comment for the explicit BlockKey(size_t hash) constructor that explains it constructs a key from a precomputed hash (used for lookups/comparisons where only the hash is available), clarifies it does not populate other key fields, notes any invariants or expected source of the hash (e.g., result of BlockKey::hash() or external hashing function), and documents intended usage scenarios (fast-lookup, temporary key for map/set operations) and ownership/lifetime implications.

coderabbitai · 2025-08-13T00:37:30Z

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h

@@ -737,6 +750,8 @@
        return 0;
    }

+    [[nodiscard]] std::optional<KVCacheBlock> findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const;


🛠️ Refactor suggestion

Document the new hash-based lookup method.

The findBlocksInReuseTreeByHashes method needs documentation explaining its purpose and return value.

Add documentation:

+ /// @brief Find blocks in the reuse tree matching the given hashes. + /// @param hashes Vector of hash values to search for. + /// @return Optional KVCacheBlock if matching blocks are found. [[nodiscard]] std::optional<KVCacheBlock> findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const;

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

[[nodiscard]] std::optional<KVCacheBlock> findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const;

/// @brief Find blocks in the reuse tree matching the given hashes.

/// @param hashes Vector of hash values to search for.

/// @return Optional KVCacheBlock if matching blocks are found.

[[nodiscard]] std::optional<KVCacheBlock> findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const;

🤖 Prompt for AI Agents

In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around line 753, the new method declaration findBlocksInReuseTreeByHashes lacks documentation; add a Doxygen-style comment above it that briefly describes the method's purpose (lookup KVCacheBlock(s) in the reuse tree by a vector of hash values), explains the parameter (hashes: vector of size_t hash keys to match), describes the return value (std::optional<KVCacheBlock> — present when a matching block is found, std::nullopt when none match), and any thread-safety or performance notes (e.g., const method, does not modify state, complexity if relevant). Ensure the comment is concise and follows existing file doc style.

coderabbitai · 2025-08-13T00:37:30Z

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp

+static int32_t tagFromRequestId(LlmRequest::RequestIdType requestId)
+{
+    constexpr int32_t kDATA_TAG{43};
+    return ((requestId & 0xFFF) << 8) | (kDATA_TAG & 0xFF);
+}


🛠️ Refactor suggestion

Document the magic number in tagFromRequestId.

The function uses a magic number 0xFFF for masking and 43 as a base tag. These should be documented or made configurable.

Add documentation and consider using named constants:

+// Generate a unique tag for MPI communication based on request ID +// Uses lower 12 bits of request ID to avoid tag overflow +static constexpr int32_t kRequestIdMask{0xFFF}; +static constexpr int32_t kDataTagBase{43}; + static int32_t tagFromRequestId(LlmRequest::RequestIdType requestId) { - constexpr int32_t kDATA_TAG{43}; - return ((requestId & 0xFFF) << 8) | (kDATA_TAG & 0xFF); + return ((requestId & kRequestIdMask) << 8) | (kDataTagBase & 0xFF); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

static int32_t tagFromRequestId(LlmRequest::RequestIdType requestId)

{

constexpr int32_t kDATA_TAG{43};

return ((requestId & 0xFFF) << 8) | (kDATA_TAG & 0xFF);

}

// Generate a unique tag for MPI communication based on request ID

// Uses lower 12 bits of request ID to avoid tag overflow

static constexpr int32_t kRequestIdMask{0xFFF};

static constexpr int32_t kDataTagBase{43};

static int32_t tagFromRequestId(LlmRequest::RequestIdType requestId)

{

return ((requestId & kRequestIdMask) << 8) | (kDataTagBase & 0xFF);

}

🤖 Prompt for AI Agents

In cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp around lines 41 to 45, the bitmask 0xFFF and the numeric tag 43 are used without explanation; replace these magic numbers with clearly named constexpr constants (e.g., kRequestIdMask = 0xFFF, kRequestIdShift = 8, kDataTag = 43 or an enum for tag types), add a short comment documenting the tag layout (how many bits are reserved for requestId, where the tag sits, why 0xFFF is chosen), and consider making kDataTag configurable at a higher scope if different tags may be needed; then update the function to use those named constants so the intent is explicit and maintainable.

coderabbitai · 2025-08-13T00:37:31Z

cpp/tensorrt_llm/batch_manager/dataTransceiver.h

+    /// @brief Receive the request information.
+    /// @return The request information.
+    [[nodiscard]] RequestInfo recvRequestInfo();


🛠️ Refactor suggestion

Document the purpose of new methods in CacheSender.

The recvRequestInfo method lacks documentation explaining its purpose and when it should be called.

Add documentation:

+ /// @brief Receive request information from a peer. + /// @return The request information received. [[nodiscard]] RequestInfo recvRequestInfo();

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In cpp/tensorrt_llm/batch_manager/dataTransceiver.h around lines 136-138, the recvRequestInfo() declaration lacks a descriptive comment; add a doc comment that states this method receives and returns the next RequestInfo from the transceiver, describes the expected caller (e.g., worker or consumer thread), when it should be called (blocking vs non-blocking behavior, or only after a prior handshake/enqueue), what the return value represents and ownership semantics, and any thread-safety or error conditions (e.g., may throw or return empty/default if closed). Ensure the comment is concise and placed immediately above the method declaration.

coderabbitai · 2025-08-13T00:37:31Z

cpp/tensorrt_llm/batch_manager/dataTransceiver.h

+    /// @brief Synchronously send data.
+    /// @param llmRequest The request object to which the data belongs.
+    void sendSync(LlmRequest const& llmRequest);


🛠️ Refactor suggestion

Document the synchronous send method.

The sendSync method lacks documentation explaining its blocking behavior and prerequisites.

Add documentation:

+ /// @brief Synchronously send data for the given request. + /// @param llmRequest The request object containing the data to send. + /// @details This method blocks until the data is sent. The request's data must be ready when called. void sendSync(LlmRequest const& llmRequest);

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In cpp/tensorrt_llm/batch_manager/dataTransceiver.h around lines 140-142, the sendSync declaration lacks a detailed doc comment describing that it is a blocking, synchronous send; update the Doxygen comment to state that sendSync blocks the caller until the data for the provided LlmRequest is fully transmitted (or an acknowledgement/confirmation is received), list preconditions (e.g., required buffer/state, not to be called from UI/main thread), specify thread-safety (whether it is thread-safe or must be serialized by caller), document failure modes and how they are reported (exceptions thrown or error return semantics) and any timeout or cancellation behavior; keep the comment concise and actionable so callers know when to use sendSync vs async send.

coderabbitai · 2025-08-13T00:37:31Z

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp

+std::optional<KVCacheBlock> WindowBlockManager::findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const
+{
+    std::vector<BlockKey> blockKeys;
+    for (auto const& hash : hashes)
+    {
+        blockKeys.emplace_back(hash);
+    }
+
+    auto searchRoot = mCachedBlocksRoot;
+    for (auto const& blockKey : blockKeys)
+    {
+        auto [partialMatch, numMatched, matchingBlock] = searchRoot != nullptr
+            ? searchRoot->findMatchingBlock(blockKey, false, false)
+            : std::make_tuple(false, 0, nullptr);
+        if (matchingBlock == nullptr)
+        {
+            return nullptr;
+        }
+        searchRoot = std::move(matchingBlock);
+    }
+    return searchRoot;
+}


🛠️ Refactor suggestion

Add input validation and improve error handling.

The function doesn't validate the input hashes parameter and could return unexpected results if an empty vector is passed.

Apply this diff to add input validation:

std::optional<KVCacheBlock> WindowBlockManager::findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const { + if (hashes.empty()) + { + return std::nullopt; + } + std::vector<BlockKey> blockKeys; for (auto const& hash : hashes) { blockKeys.emplace_back(hash); } auto searchRoot = mCachedBlocksRoot; for (auto const& blockKey : blockKeys) { auto [partialMatch, numMatched, matchingBlock] = searchRoot != nullptr ? searchRoot->findMatchingBlock(blockKey, false, false) : std::make_tuple(false, 0, nullptr); if (matchingBlock == nullptr) { - return nullptr; + return std::nullopt; } searchRoot = std::move(matchingBlock); } return searchRoot; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

std::optional<KVCacheBlock> WindowBlockManager::findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const

{

std::vector<BlockKey> blockKeys;

for (auto const& hash : hashes)

{

blockKeys.emplace_back(hash);

}

auto searchRoot = mCachedBlocksRoot;

for (auto const& blockKey : blockKeys)

{

auto [partialMatch, numMatched, matchingBlock] = searchRoot != nullptr

? searchRoot->findMatchingBlock(blockKey, false, false)

: std::make_tuple(false, 0, nullptr);

if (matchingBlock == nullptr)

{

return nullptr;

}

searchRoot = std::move(matchingBlock);

}

return searchRoot;

}

std::optional<KVCacheBlock> WindowBlockManager::findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const

{

if (hashes.empty())

{

return std::nullopt;

}

std::vector<BlockKey> blockKeys;

for (auto const& hash : hashes)

{

blockKeys.emplace_back(hash);

}

auto searchRoot = mCachedBlocksRoot;

for (auto const& blockKey : blockKeys)

{

auto [partialMatch, numMatched, matchingBlock] = searchRoot != nullptr

? searchRoot->findMatchingBlock(blockKey, false, false)

: std::make_tuple(false, 0, nullptr);

if (matchingBlock == nullptr)

{

return std::nullopt;

}

searchRoot = std::move(matchingBlock);

}

return searchRoot;

}

🤖 Prompt for AI Agents

In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp around lines 1011 to 1032, validate the input and improve null handling: if hashes.empty() immediately return std::nullopt; if mCachedBlocksRoot is nullptr also return std::nullopt before looping; inside the loop keep the existing search logic but when matchingBlock is nullptr return std::nullopt (not nullptr); at the end, return a populated std::optional<KVCacheBlock> constructed from the found block (or std::nullopt if no block), ensuring you convert pointer results into the optional return type consistently.

💡 Verification agent

🧩 Analysis chain

Verify integration with new BlockKey hash constructor.

The function relies on BlockKey(hash) constructor to initialize keys from precomputed hashes. Ensure this constructor is properly implemented and tested.

🏁 Script executed:

#!/bin/bash # Search for BlockKey constructor that takes a hash parameter ast-grep --pattern $'class BlockKey { $$$ BlockKey($_, size_t $_) { $$$ } $$$ }' # Also search for any BlockKey constructor with single parameter ast-grep --pattern $'BlockKey(size_t $_)'

Length of output: 136

🏁 Script executed:

#!/bin/bash # Search for BlockKey definition and any constructors across the repo rg "class BlockKey" -n -C3 rg "struct BlockKey" -n -C3 # Look for constructor implementations and declarations rg "BlockKey::BlockKey" -n -C3 rg "explicit BlockKey" -n -C3 rg "BlockKey(size_t" -n -C3

Length of output: 3613

🏁 Script executed:

#!/bin/bash # Display BlockKeyTest to verify coverage of the size_t constructor sed -n '1,200p' cpp/tests/batch_manager/blockKeyTest.cpp

Length of output: 674

Ensure unit tests cover the hash‐based BlockKey constructor

The explicit BlockKey(size_t) ctor in kvCacheManager.h (lines 132–136) is used by findBlocksInReuseTreeByHashes but isn’t exercised by any existing test. Please add or extend unit tests to cover:

Direct construction via BlockKey(hash)

Its interaction with findMatchingBlock/findBlocksInReuseTreeByHashes

This will validate that precomputed hashes are handled correctly.

🤖 Prompt for AI Agents

In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp around lines 1011 to 1032, add unit tests that exercise the explicit BlockKey(size_t) constructor and its use in findBlocksInReuseTreeByHashes: construct BlockKey objects directly from precomputed hashes (BlockKey(hash)), insert/create cached blocks in the reuse tree as needed, call findBlocksInReuseTreeByHashes with vectors of those hashes, and assert the function returns the expected KVCacheBlock (positive case) and returns null/empty when hashes don’t match (negative case); include a test that verifies findMatchingBlock works equivalently when using the size_t ctor versus the standard ctor to ensure hash-based lookup behaves correctly.

coderabbitai · 2025-08-13T00:37:31Z

tensorrt_llm/_torch/pyexecutor/py_executor.py

+                    if not self.block_reuse_enabled:
+                        ctx_transmission_reqs = self._send_disagg_ctx_cache(
+                            scheduled_batch.context_requests
+                        ) if self.kv_cache_transceiver else []
+


⚠️ Potential issue

Review the duplicate disagg context cache sending logic.

The code appears to have duplicate logic for sending disagg context cache - once when block_reuse_enabled is False and again unconditionally. This could lead to sending the cache twice.

The logic should be consolidated:

- if not self.block_reuse_enabled: - ctx_transmission_reqs = self._send_disagg_ctx_cache( - scheduled_batch.context_requests - ) if self.kv_cache_transceiver else [] - if self.kv_cache_transceiver: ctx_transmission_reqs = self._send_disagg_ctx_cache( scheduled_batch.context_requests)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In tensorrt_llm/_torch/pyexecutor/py_executor.py around lines 961 to 965, there is duplicate logic that can send the disagg context cache twice: first inside the if not self.block_reuse_enabled branch and then again unconditionally; consolidate this by sending the cache exactly once—either keep the send inside the block_reuse_enabled=False branch and remove the unconditional send, or make the unconditional send conditional on block_reuse_enabled being False; ensure the send uses the same kv_cache_transceiver guard (i.e., only call self._send_disagg_ctx_cache(...) when self.kv_cache_transceiver is truthy) and remove the redundant call so the cache cannot be sent twice.

Shixiaowei02 · 2025-08-13T12:23:05Z

cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h

@@ -80,6 +80,19 @@ class BlockRange
        mBlockIds = std::move(blockIds);
    }

+    void setBlockIdsFromHashes(std::vector<size_t> blockHashes)


const std::vector<size_t>& is better here.

Shixiaowei02 · 2025-08-13T15:04:22Z

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp

@@ -39,23 +39,20 @@
 namespace tensorrt_llm::batch_manager::kv_cache_manager
 {

-BlockRange getBlockRangeForSending(BaseKVCacheManager* cacheManager, LlmRequest const& llmRequest)
+BlockRange getBlockRangeForSending(
+    BaseKVCacheManager* cacheManager, LlmRequest const& llmRequest, std::vector<size_t> const& blockHashes)


Is it possible to remove the llmRequest parameter in the future?

coderabbitai bot requested review from nv-guomingz, Superjomn, tburt-nv and venkywonka July 25, 2025 02:24

coderabbitai bot reviewed Jul 25, 2025

View reviewed changes

examples/disaggregated/clients/disagg_client.py Outdated Show resolved Hide resolved

examples/disaggregated/clients/disagg_client.py Outdated Show resolved Hide resolved

examples/disaggregated/clients/disagg_client.py Outdated Show resolved Hide resolved

Tabrizian force-pushed the user/imant/useHashIds branch from eed674f to 376d131 Compare July 25, 2025 02:26

coderabbitai bot requested a review from DomBrown July 25, 2025 02:26

Tabrizian requested review from pcastonguay, chuangz0 and Shixiaowei02 and removed request for Superjomn, DomBrown, venkywonka, nv-guomingz and tburt-nv July 25, 2025 02:27

Tabrizian force-pushed the user/imant/useHashIds branch from 376d131 to 90d8774 Compare July 25, 2025 02:28

coderabbitai bot requested a review from nv-guomingz July 25, 2025 02:30

coderabbitai bot reviewed Jul 25, 2025

View reviewed changes

Tabrizian force-pushed the user/imant/useHashIds branch 2 times, most recently from ddc66c6 to 7d3e452 Compare July 25, 2025 19:27

coderabbitai bot reviewed Jul 25, 2025

View reviewed changes

Tabrizian added 2 commits August 10, 2025 16:53

Rename data -> cache

abb41e4

Signed-off-by: Iman Tabrizian <[email protected]>

Refactor dataTransceiver classes

d1c5f80

Signed-off-by: Iman Tabrizian <[email protected]>

Tabrizian force-pushed the user/imant/useHashIds branch from 7d3e452 to 1a21dba Compare August 12, 2025 20:06

Tabrizian requested review from a team as code owners August 12, 2025 20:06

coderabbitai bot reviewed Aug 12, 2025

View reviewed changes

Tabrizian changed the title ~~Refactor dataTransciever classes~~ feat: Add support for KVCache transfer from KVCache reuse path Aug 12, 2025

Initial iteration for supporting block hash transfer

bd2757d

Signed-off-by: Iman Tabrizian <[email protected]>

Tabrizian force-pushed the user/imant/useHashIds branch from 1a21dba to bd2757d Compare August 13, 2025 00:27

Tabrizian marked this pull request as draft August 13, 2025 00:28

coderabbitai bot reviewed Aug 13, 2025

View reviewed changes

Shixiaowei02 reviewed Aug 13, 2025

View reviewed changes

-    void pinBlocks(GenerationRequest& sequence);
+    //! \brief Pin blocks associated with a sequence to prevent eviction.
+    //! \param sequence The generation request whose blocks should be pinned.
+    //! \details This method marks blocks as pinned in the KV cache to ensure they
+    //!          remain available for reuse across multiple requests.
+    void pinBlocks(GenerationRequest& sequence);

-    [[nodiscard]] WindowBlockManager::BlockMapIterRange getBlocksByHash(size_t hash, SizeType32 windowSize) const
-    {
-        return mWindowBlockManagers.at(windowSize).getBlocksByHash(hash);
-    }
-    [[nodiscard]] BlockPtr const& getBlockFromReuseTreeByHash(size_t hash, SizeType32 windowSize) const
-    {
-        return mWindowBlockManagers.at(windowSize).getBlockFromReuseTreeByHash(hash);
-    }
+    //! \brief Get blocks matching the given hash for a specific window size.
+    //! \param hash The hash value to search for.
+    //! \param windowSize The window size to search within.
+    //! \return Iterator range over matching blocks.
+    [[nodiscard]] WindowBlockManager::BlockMapIterRange getBlocksByHash(size_t hash, SizeType32 windowSize) const
+    {
+        auto it = mWindowBlockManagers.find(windowSize);
+        TLLM_CHECK_WITH_INFO(it != mWindowBlockManagers.end(),
+            "Window size %d not found in window block managers", windowSize);
+        return it->second.getBlocksByHash(hash);
+    }
+    //! \brief Get a block from the reuse tree matching the given hash.
+    //! \param hash The hash value to search for.
+    //! \param windowSize The window size to search within.
+    //! \return Reference to the matching block pointer.
+    [[nodiscard]] BlockPtr const& getBlockFromReuseTreeByHash(size_t hash, SizeType32 windowSize) const
+    {
+        auto it = mWindowBlockManagers.find(windowSize);
+        TLLM_CHECK_WITH_INFO(it != mWindowBlockManagers.end(),
+            "Window size %d not found in window block managers", windowSize);
+        return it->second.getBlockFromReuseTreeByHash(hash);
+    }

		virtual void pinBlocks(LlmRequest::RequestIdType requestId) = 0;

-    virtual void pinBlocks(LlmRequest::RequestIdType requestId) = 0;
+    //! \brief Pin blocks associated with a request to prevent eviction.
+    //! \param requestId The ID of the request whose blocks should be pinned.
+    //! \details Implementations should mark all blocks associated with the request
+    //!          as pinned to ensure they remain available for potential reuse.
+    virtual void pinBlocks(LlmRequest::RequestIdType requestId) = 0;

		/// @return The request information.
		[[nodiscard]] RequestInfo recvRequestInfo();

-    [[nodiscard]] std::optional<KVCacheBlock> findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const;
+    /// @brief Find blocks in the reuse tree matching the given hashes.
+    /// @param hashes Vector of hash values to search for.
+    /// @return Optional KVCacheBlock if matching blocks are found.
+    [[nodiscard]] std::optional<KVCacheBlock> findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const;

-static int32_t tagFromRequestId(LlmRequest::RequestIdType requestId)
-{
-    constexpr int32_t kDATA_TAG{43};
-    return ((requestId & 0xFFF) << 8) | (kDATA_TAG & 0xFF);
-}
+// Generate a unique tag for MPI communication based on request ID
+// Uses lower 12 bits of request ID to avoid tag overflow
+static constexpr int32_t kRequestIdMask{0xFFF};
+static constexpr int32_t kDataTagBase{43};
+static int32_t tagFromRequestId(LlmRequest::RequestIdType requestId)
+{
+    return ((requestId & kRequestIdMask) << 8) | (kDataTagBase & 0xFF);
+}

feat: Add support for KVCache transfer from KVCache reuse path #6348

Are you sure you want to change the base?

feat: Add support for KVCache transfer from KVCache reuse path #6348

Uh oh!

Conversation

Tabrizian commented Jul 25, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

Status, Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tabrizian commented Jul 25, 2025

Uh oh!

tensorrt-cicd commented Jul 25, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented Jul 25, 2025

Uh oh!

Tabrizian commented Jul 25, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jul 25, 2025

Uh oh!

tensorrt-cicd commented Jul 25, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Aug 12, 2025

Tabrizian commented Jul 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 25, 2025 •

edited

Loading