Skip to content

feat: Add support for KVCache transfer from KVCache reuse path #6348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

Tabrizian
Copy link
Member

@Tabrizian Tabrizian commented Jul 25, 2025

Summary by CodeRabbit

  • Refactor

    • Redesigned cache transfer/session handling for robust multi-connection and agent-based transmission; simplified communication state and removed legacy transceiver components.
  • New Features

    • Per-request KV-cache pinning to prevent eviction during processing, exposed to Python.
    • Block-hash-driven selective block transfer and optional per-transfer metrics collection/export.
    • Executor toggle to change cache-transmission behavior based on block-reuse setting.
  • Tests

    • Updated tests to match the new transfer/session architecture.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Copy link
Contributor

coderabbitai bot commented Jul 25, 2025

📝 Walkthrough

Walkthrough

Replaces DataResponder/DataRequester/DataSender/DataReceiver with CacheSender/CacheReceiver, introduces TransferSession and session-based cache-formatting/transfer APIs, adds block-hash driven send/receive paths and pinning APIs, removes legacy impl files, updates agent/connection naming, and adjusts tests and build lists accordingly.

Changes

Cohort / File(s) Change Summary
Transceiver header
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
Forward declarations renamed (DataResponderCacheSender, DataRequesterCacheReceiver); member pointers and future vector renamed (mDataRespondermCacheSender, mDataRequestermCacheReceiver, mResponderFuturesmSenderFutures).
Transceiver impl
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
Uses CacheSender/CacheReceiver instead of DataResponder/DataRequester; send calls use sendAsync() and store futures in mSenderFutures; receive uses receiveAsync(); removed old impl include.
Core redesign (wrappers & impl)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp, cpp/tensorrt_llm/batch_manager/dataTransceiver.h
Introduced CacheSender and CacheReceiver wrappers with PIMPLs and new APIs (sendAsync/receiveAsync, get/set comm state, request-info/session helpers); added aliases (SizeType32, TransferSession, BaseCacheFormatter) and TransceiverTag; major multi-connection, TransferSession and promise-based send/receive logic added.
Removed legacy impls & build
cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, .../dataTransceiverImpl.h, cpp/tensorrt_llm/batch_manager/CMakeLists.txt
Deleted legacy DataSenderImpl/DataReceiverImpl and their header; removed dataTransceiverImpl.cpp from SRCS in CMakeLists.
Cache formatter & session API
cpp/include/tensorrt_llm/batch_manager/cacheFormatter.h, .../cacheFormatter.cpp, .../mlaCacheFormatter.cpp
Added TransferSession, KvCacheMeasureHelper, BaseCacheFormatter interface; CacheFormatter/MLACacheFormatter rewritten to use TransferSession; getBlockRangeForSending now accepts block-hashes and sending logic uses provided block-hash list.
KV-cache manager & utils
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h, .../kvCacheManager.cpp, .../kvCacheUtils.h
Added block-pinning APIs (pinBlocks) across managers, reuse-tree lookup by hashes (findBlocksInReuseTreeByHashes), BlockKey.hash field and constructor, and BlockRange::setBlockIdsFromHashes.
Request API change
cpp/include/tensorrt_llm/batch_manager/llmRequest.h
Removed public accessors for requested block hashes (setter and getter removed).
Agent connection rename
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h, .../connection.cpp
Renamed parameter/member mReceiverBufferDescmCacheReceiverBufferDesc in AgentConnection sender state and setter.
Includes & tests
cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp, cpp/tests/batch_manager/cacheTransceiverTest.cpp, cpp/tests/unit_tests/executor/ucxCommTest.cpp
Updated includes to dataTransceiver.h; tests and mocks renamed from Data*Cache*; removed redundant include lines.
Env utility removal
cpp/tensorrt_llm/common/envUtils.cpp
Removed getEnvDisableSelectiveCacheTransfer() (environment-based selective-transfer flag removed).
Python binding & executor behavior
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp, tensorrt_llm/_torch/pyexecutor/py_executor.py
Exposed BaseKVCacheManager::pinBlocks as pin_blocks in Python; added block_reuse_enabled flag to PyExecutor and conditional flow to pin/terminate or keep ctx-in-transmission depending on flag.

Sequence Diagram(s)

sequenceDiagram
    participant Request as LlmRequest
    participant Sender as CacheSender
    participant ConnMgr as ConnectionManager
    participant Formatter as BaseCacheFormatter
    participant Receiver as CacheReceiver

    Request->>Sender: sendAsync(request)
    Sender->>ConnMgr: build TransferSession (connections, states, blockHashes)
    Sender->>Formatter: format(TransferSession)
    Sender->>ConnMgr: transmit data (async)
    ConnMgr->>Receiver: deliver request info / data
    Receiver->>Formatter: unformat(TransferSession)
    Receiver->>Request: complete receive (future)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Possibly related PRs

Suggested labels

Community want to contribute, Generic Runtime

Suggested reviewers

  • pcastonguay
  • chuangz0
  • DomBrown
  • Funatiq
  • Shixiaowei02
✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🔭 Outside diff range comments (4)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (1)

18-18: Use preprocessor guard instead of pragma once.

According to coding guidelines, header files should use a preprocessor guard with prefix TRTLLM_ followed by the filename in caps.

Replace #pragma once with a proper header guard:

-#pragma once
+#ifndef TRTLLM_CONNECTION_H
+#define TRTLLM_CONNECTION_H

And add at the end of the file:

+#endif // TRTLLM_CONNECTION_H
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (2)

17-17: Use preprocessor guard instead of pragma once.

According to coding guidelines, header files should use a preprocessor guard with prefix TRTLLM_ followed by the filename in caps.

Replace #pragma once with a proper header guard:

-#pragma once
+#ifndef TRTLLM_CACHETRANSCEIVER_H
+#define TRTLLM_CACHETRANSCEIVER_H

And add at the end of the file:

+#endif // TRTLLM_CACHETRANSCEIVER_H

2-2: Update copyright year to include 2025.

According to coding guidelines, all TensorRT-LLM code should contain an NVIDIA copyright header that includes the current year (2025).

- * Copyright (c) 2023-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2023-2025, NVIDIA CORPORATION.  All rights reserved.
cpp/tests/batch_manager/cacheTransceiverTest.cpp (1)

655-658: Remove unused variable assignment.

The variable cacheType is assigned but never used. This appears to be dead code.

-        if (kvFactor == 1)
-        {
-            auto cacheType = CacheType::kSELFKONLY;
-        }
🧹 Nitpick comments (2)
examples/disaggregated/clients/disagg_client.py (1)

81-126: Consider adding timing measurements to send_chat_request for consistency.

The send_chat_request function lacks the same TTFT and ITL timing measurements that were added to send_request. This creates inconsistent behavior between the two endpoints.

Consider adding similar timing measurements to send_chat_request or extracting the timing logic into a shared helper function to maintain consistency across both endpoints.

examples/disaggregated/clients/prompts.json (1)

1-3: Consider maintaining diverse test prompts for comprehensive testing.

While simplifying to a single prompt reduces complexity, having diverse prompts helps test various scenarios (different lengths, complexity, languages, etc.). Consider whether this reduction impacts test coverage adequately.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0f2f11f and eed674f.

📒 Files selected for processing (13)
  • cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (2 hunks)
  • cpp/tensorrt_llm/batch_manager/CMakeLists.txt (0 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h (3 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (7 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (15 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h (3 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (0 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h (0 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (2 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (2 hunks)
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp (11 hunks)
  • examples/disaggregated/clients/disagg_client.py (3 hunks)
  • examples/disaggregated/clients/prompts.json (1 hunks)
💤 Files with no reviewable changes (3)
  • cpp/tensorrt_llm/batch_manager/CMakeLists.txt
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
🧰 Additional context used
📓 Path-based instructions (4)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0, nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and must be case-insensitive unique within a compilation target.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants are uppercase snakecase with prefix...

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
  • cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
  • examples/disaggregated/clients/disagg_client.py
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
  • cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code should conform to Python 3.8+.
Indent Python code with 4 spaces. Do not use tabs.
Always maintain the namespace when importing in Python, even if only one class or function from a module is used.
Python filenames should use snake_case (e.g., some_file.py).
Python classes should use PascalCase (e.g., class SomeClass).
Python functions and methods should use snake_case (e.g., def my_awesome_function():).
Python local variables should use snake_case. Prefix k for variable names that start with a number (e.g., k_99th_percentile).
Python global variables should use upper snake_case and prefix G (e.g., G_MY_GLOBAL).
Python constants should use upper snake_case (e.g., MY_CONSTANT).
Avoid shadowing variables declared in an outer scope in Python.
Initialize all externally visible members of a Python class in the constructor.
For interfaces that may be used outside a Python file, prefer docstrings over comments.
Comments in Python should be reserved for code within a function, or interfaces that are local to a file.
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx.
Attributes and variables in Python can be documented inline; attribute docstrings will be rendered under the docstring for the class.
Avoid using reflection in Python when functionality can be easily achieved without it.
When using try-except blocks in Python, limit the except to the smallest set of errors possible.
When using try-except blocks to handle multiple possible variable types in Python, keep the body of the try as small as possible, using the else block to implement the logic.

Files:

  • examples/disaggregated/clients/disagg_client.py
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
  • cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
🧠 Learnings (3)
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🔇 Additional comments (24)
cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (1)

178-178: Consistent renaming verified

I performed a repository-wide search and confirmed there are no remaining occurrences of mReceiverBufferDesc. All references in

  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp

have been updated to mCacheReceiverBufferDesc.

LGTM!

cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (2)

84-84: LGTM! Consistent implementation of the renamed member variable.

The changes correctly implement the renamed mCacheReceiverBufferDesc member variable throughout the implementation, maintaining consistency with the header file updates.

Also applies to: 140-142


468-468: LGTM! Proper namespace closing comment.

The namespace closing comment follows the coding guidelines correctly.

cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (1)

37-38: Renaming Verification Complete

All references to the old class names have been removed:

  • No matches for DataResponder
  • No matches for DataRequester
  • No matches for mDataSender or mDataReceiver

The renaming to CacheSender/CacheReceiver is consistent and complete. No further action required.

cpp/tests/batch_manager/cacheTransceiverTest.cpp (6)

159-185: LGTM! Mock class properly renamed and implemented.

The mock class follows appropriate testing patterns with proper use of Google Mock features.


186-192: LGTM! Mock receiver class properly implemented.

The mock methods are correctly defined for testing cache receiving functionality.


211-230: LGTM! Test properly updated to use renamed classes.

The test correctly uses CacheSender instead of DataResponder while maintaining the original test logic.


232-252: LGTM! Test properly uses CacheReceiver.

The test correctly replaces DataRequester with CacheReceiver while maintaining test functionality.


396-404: LGTM! Cache transceiver setup properly updated.

The setup correctly instantiates CacheSender and CacheReceiver with proper implementations using CacheSenderImpl and CacheReceiverImpl.


762-768: LGTM! Asymmetric cache test properly updated.

The test correctly uses CacheSender and CacheReceiver with their implementation classes.

cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (3)

196-200: LGTM! Proper initialization of cache sender and receiver.

The constructor correctly initializes CacheSender and CacheReceiver with appropriate parameters and formatter creation.


236-298: LGTM! Methods properly updated to use new cache classes.

All async send/receive operations correctly use mCacheSender and mCacheReceiver with proper future handling.


369-436: LGTM! Context transfer status checking properly updated.

The method correctly uses mSenderFutures (renamed from mResponderFutures) while maintaining the original synchronization logic.

cpp/tensorrt_llm/batch_manager/cacheFormatter.h (2)

45-122: LGTM! Well-designed TransferSession class.

The class provides a clean abstraction for managing transfer sessions with proper const-correctness, precondition checks, and encapsulation.


163-222: LGTM! Performance measurement helper properly implemented.

The class provides thread-safe performance measurement with proper file output handling. Good use of RAII pattern for automatic output on destruction.

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

115-183: LGTM! Clean interface design with pImpl pattern.

The CacheSender and CacheReceiver classes provide clean interfaces with proper use of the pImpl idiom for implementation hiding and ABI stability.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (8)

1-32: LGTM: Copyright header and includes look good.

The copyright header contains the current year (2025) as required, and the new include for agent connection utilities aligns with the refactoring objectives.


38-57: Good RAII design and helper functions.

The new helper structures and functions are well-designed:

  • Using declarations improve readability
  • ReceiveCacheResource follows RAII principles correctly
  • tagFromRequestId implements proper bit manipulation for tag generation

The magic number kDATA_TAG{43} is appropriately named as a constexpr constant.


115-174: Well-designed constructor and core methods.

The implementation demonstrates good practices:

  • Constructor properly validates inputs with TLLM_CHECK
  • Resource management uses smart pointers appropriately
  • Thread safety is maintained with proper mutex usage
  • Dependency injection pattern is correctly implemented

The new methods getCounterpartsCount and release provide clean session management.


176-228: Complex but well-structured request info handling.

The method correctly handles both agent and direct connection scenarios:

  • Appropriate use of dynamic_cast for type discrimination
  • Comprehensive error handling with informative messages
  • Proper thread safety with mutex protection
  • Session management is correctly implemented

The method's complexity is justified given the dual connection handling requirements.


230-345: Excellent thread management and synchronization.

The implementation demonstrates solid concurrent programming practices:

  • Clean separation between sync and async operations
  • Proper use of condition variables and mutexes for thread coordination
  • Exception handling ensures promises are always resolved
  • Thread naming (dataTransResp) aids in debugging
  • Producer-consumer pattern is correctly implemented

The variable renaming from responder to sender terminology is consistent with the refactor.


405-529: Solid CacheReceiver implementation with good async handling.

The receiver implementation mirrors the sender's good practices:

  • Constructor properly validates inputs and manages resources
  • Async operations are correctly implemented with std::async
  • Complex connection handling for both agent and direct scenarios
  • Proper resource management with ReceiveCacheResource
  • Thread safety maintained throughout

The symmetry with CacheSender provides a clean, consistent API.


560-698: Well-implemented async resource management.

The async processing implementation demonstrates good concurrent design:

  • AsyncResource properly encapsulates thread synchronization primitives
  • Worker thread pattern correctly processes queued requests
  • Proper termination handling prevents resource leaks
  • Exception handling ensures all promises are resolved
  • Thread naming (dataTransRequest) aids debugging
  • Use of std::deque for request queue is appropriate

The cleanup logic in the destructor properly terminates all worker threads.


700-736: Clean public interface implementation using pimpl idiom.

The public interfaces are well-designed:

  • Proper use of pimpl idiom with std::unique_ptr<Impl>
  • Constructor delegation to implementation is clean
  • Method delegation is straightforward and maintains const correctness
  • Defaulted destructors are appropriate with smart pointer members

The namespace closing comment follows the coding guidelines correctly.

@Tabrizian Tabrizian force-pushed the user/imant/useHashIds branch from eed674f to 376d131 Compare July 25, 2025 02:26
@coderabbitai coderabbitai bot requested a review from DomBrown July 25, 2025 02:26
@Tabrizian
Copy link
Member Author

/bot run

@Tabrizian Tabrizian force-pushed the user/imant/useHashIds branch from 376d131 to 90d8774 Compare July 25, 2025 02:28
@coderabbitai coderabbitai bot requested a review from nv-guomingz July 25, 2025 02:30
@tensorrt-cicd
Copy link
Collaborator

PR_Github #12929 [ run ] triggered by Bot

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

420-424: Consider thread pool instead of frequent thread creation.

The TODO comment indicates awareness of the performance issue with frequent thread creation. The current implementation uses std::async with std::launch::async which may create new threads frequently.

Consider implementing a thread pool for better performance:

- return std::async(std::launch::async, &CacheReceiver::Impl::requestSync, this, std::ref(llmRequest));
+ // Use thread pool implementation instead of direct async call
+ return mThreadPool.submit(&CacheReceiver::Impl::requestSync, this, std::ref(llmRequest));
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 376d131 and 90d8774.

📒 Files selected for processing (11)
  • cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (2 hunks)
  • cpp/tensorrt_llm/batch_manager/CMakeLists.txt (0 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h (3 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (7 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (15 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h (3 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (0 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h (0 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (2 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (2 hunks)
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp (11 hunks)
💤 Files with no reviewable changes (3)
  • cpp/tensorrt_llm/batch_manager/CMakeLists.txt
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
🚧 Files skipped from review as they are similar to previous changes (6)
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
  • cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0, nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and must be case-insensitive unique within a compilation target.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants are uppercase snakecase with prefix...

Files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
🧠 Learnings (2)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (48)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (17)

24-24: LGTM: Include added for new cache formatter abstraction.

The new include for cacheFormatter.h is correctly added to support the refactored architecture that uses BaseCacheFormatter and TransferSession from the kv_cache_manager namespace.


42-44: LGTM: Type aliases properly updated for the refactored architecture.

The type aliases correctly reference the new abstractions:

  • SizeType32 properly aliased from runtime namespace
  • TransferSession correctly aliased from kv_cache_manager namespace, replacing the previous internal definition

45-58: LGTM: TransceiverTag struct provides proper message tagging.

The TransceiverTag struct correctly defines:

  • Enumerated message types for REQUEST_SEND and TERMINATION
  • Appropriate tag constants for ID, INFO_SIZE, and INFO message components
  • Uses constexpr for compile-time constants as per coding guidelines

The naming follows the coding guidelines with uppercase snake_case for constants prefixed with 'k'.


58-58: LGTM: BaseCacheFormatter type alias correctly defined.

The type alias correctly references the BaseCacheFormatter from the kv_cache_manager namespace, maintaining consistency with the new architecture.


115-121: LGTM: CacheSender constructor signature properly designed.

The constructor correctly accepts:

  • ConnectionManager* for managing connections
  • CacheState for cache state information
  • SizeType32 for self index
  • std::unique_ptr<BaseCacheFormatter> for cache formatting operations

The use of std::unique_ptr follows C++ best practices for single resource ownership as specified in the coding guidelines.


122-126: LGTM: sendAsync method signature correctly updated.

The method signature properly:

  • Returns std::future<void> for asynchronous operation
  • Takes LlmRequest& as parameter
  • Uses [[nodiscard]] attribute appropriately
  • Includes comprehensive documentation

130-134: LGTM: Communication state accessors properly updated.

The getCommState() and setCommState() methods correctly use the new executor::kv_cache::CommState type, maintaining the same interface pattern as before but with updated types.


136-150: LGTM: Additional methods properly defined for the refactored architecture.

The new methods correctly provide:

  • recvRequestInfo() for receiving request information
  • sendSync() for synchronous sending
  • sendRequestInfo() returning TransferSession
  • receiveSync() accepting TransferSession&

All methods use appropriate parameter types and return types for the new architecture.


153-153: LGTM: Destructor properly declared.

The destructor is correctly declared and will be implemented to properly clean up resources.


160-177: LGTM: CacheReceiver class properly designed.

The CacheReceiver class correctly:

  • Has matching constructor signature with CacheSender
  • Provides receiveAsync() method for asynchronous reception
  • Includes sendRequestInfo() and receiveSync() methods for coordination
  • Uses proper parameter and return types
  • Follows the same design patterns as CacheSender

24-24: LGTM! Proper include for new dependency.

The inclusion of cacheFormatter.h aligns with the refactoring to use kv_cache_manager types.


42-44: LGTM! Clean type aliasing approach.

The type aliases consolidate dependencies and the TransferSession alias properly delegates to the kv_cache_manager namespace as intended by the refactor.


45-57: LGTM! Well-structured communication protocol.

The TransceiverTag struct provides a clean enumeration of message types and associated tag constants for the communication protocol.


58-58: LGTM! Consistent type aliasing.

The BaseCacheFormatter alias maintains consistency with the delegation pattern to kv_cache_manager.


160-182: LGTM! Well-designed receiver interface.

The CacheReceiver class interface mirrors the sender pattern appropriately, with proper constructor signature and async receive capability.


184-184: LGTM! Proper namespace closure comment.

The namespace closure comment follows the coding guidelines requirement for namespace comments.


115-158: Constructor Validation Confirmed – No Action Needed

The Impl constructor already enforces non-null and consistent state checks:

  • Line 127: TLLM_CHECK(mManager); ensures manager is not null.
  • Line 128: TLLM_CHECK(mManager->getCommState().getSelfIdx() == selfIndex); verifies selfIndex matches the manager’s state.

All public methods simply forward to mImpl, maintaining interface consistency.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (31)

26-26: LGTM: Include added for agent connection utilities.

The new include for agent connection utilities is correctly added to support the refactored connection management functionality.


38-45: LGTM: Helper function and type aliases properly implemented.

The code correctly:

  • Defines type aliases for the agent connection manager and data context
  • Implements tagFromRequestId() helper function with proper bit manipulation to create unique tags
  • Uses constexpr for the constant as per coding guidelines

47-57: LGTM: ReceiveCacheResource struct properly designed.

The struct correctly:

  • Manages buffer manager and CUDA event resources
  • Uses move semantics in the constructor
  • Follows proper resource management patterns
  • Uses member initializer list as per C++ best practices

115-132: LGTM: CacheSender::Impl constructor properly implemented.

The constructor correctly:

  • Initializes all member variables using member initializer list
  • Performs proper null pointer checks with TLLM_CHECK
  • Validates self index consistency
  • Sets up CUDA device context
  • Launches the response thread asynchronously

The use of std::make_shared for CudaStream follows smart pointer best practices.


134-149: LGTM: sendAsync method properly implemented with thread synchronization.

The method correctly:

  • Creates promise/future pair for asynchronous operation
  • Uses proper mutex locking for thread safety
  • Adds requests to the ready responses queue
  • Notifies waiting threads via condition variable
  • Returns the future for caller synchronization

The nested locking approach prevents race conditions.


230-237: LGTM: sendSync method properly implemented.

The method correctly:

  • Looks up the transfer session by request ID
  • Sets the LLM request on the session
  • Delegates formatting to the formatter
  • Includes proper error checking

251-265: LGTM: sendAndRemoveResponse method with proper exception handling.

The method correctly:

  • Uses CUDA device setting for proper context
  • Calls synchronous send and cleanup
  • Provides proper exception handling with promise notification
  • Uses noexcept specification appropriately

267-345: LGTM: Response thread implementation with comprehensive synchronization.

The response thread correctly:

  • Sets thread name for debugging
  • Uses proper CUDA device context
  • Implements condition variable wait with proper predicates
  • Handles termination gracefully
  • Manages request counting and cleanup
  • Supports both parallel and sequential sending modes
  • Includes comprehensive error handling and logging

The thread synchronization logic appears sound with proper use of mutexes and condition variables.


408-418: LGTM: CacheReceiver::Impl constructor properly implemented.

The constructor follows the same pattern as CacheSender::Impl with:

  • Proper member initialization
  • Null pointer validation
  • Self index consistency check
  • CUDA device setup

460-463: LGTM: receiveSync method properly delegates to formatter.

The method correctly delegates the unformatting operation to the formatter using the transfer session, maintaining proper separation of concerns.


465-529: LGTM: sendRequestInfo method with comprehensive connection handling.

The method correctly:

  • Validates formatter support for cache states
  • Handles selective cache transfer optimization
  • Supports both agent and non-agent connection managers
  • Manages buffer index assignment for agent connections
  • Creates transfer session with proper parameters
  • Includes comprehensive error checking and validation

The dual path handling for agent vs. non-agent connections is properly implemented.


531-547: LGTM: getReceiveCacheResource method with proper resource management.

The method correctly:

  • Uses scoped lock for thread safety
  • Implements per-process resource isolation when concurrent mode is enabled
  • Creates resources on-demand with proper RAII
  • Uses default process string for non-concurrent mode

549-559: LGTM: sendRequestInfo helper method properly implemented.

The helper method correctly:

  • Serializes request info using the established serialization utilities
  • Sends data in the proper order (ID, size, then data)
  • Uses appropriate data contexts for different message components

700-704: LGTM: CacheSender public constructor properly implemented.

The constructor correctly creates the implementation with perfect forwarding of the formatter parameter using std::move.


706-721: LGTM: CacheSender public interface methods properly implemented.

All public methods correctly delegate to the implementation:

  • sendAsync() forwards the request properly
  • getCommState() and setCommState() maintain the interface contract
  • Destructor is properly defaulted

723-734: LGTM: CacheReceiver public interface properly implemented.

The constructor and methods correctly:

  • Create implementation with proper parameter forwarding
  • Delegate receiveAsync() to the multi-threaded implementation
  • Use defaulted destructor for proper cleanup

176-228: Dynamic_cast usage is safe – no action needed

Verified that ConnectionManager is indeed polymorphic (it declares virtual methods, including recvConnect, getConnections, and getCommState), so the dynamic_cast in recvRequestInfo() will behave as intended. The interface is stable and handles both agent and non-agent scenarios correctly.


26-26: LGTM! Updated include for agent utilities.

The include path update aligns with the refactored cache transmission structure.


38-45: LGTM! Clean helper function with proper naming.

The tagFromRequestId function follows proper naming conventions and provides a clean abstraction for generating unique tags from request IDs.


47-57: LGTM! Well-structured resource management.

The ReceiveCacheResource struct properly encapsulates buffer manager and CUDA event resources with appropriate constructor.


134-149: LGTM! Proper async pattern with promise/future.

The sendAsync method correctly implements the promise/future pattern with proper synchronization using mutexes and condition variables.


176-228: LGTM! Comprehensive request info handling with proper validation.

The recvRequestInfo method properly handles both agent and direct connection scenarios, with appropriate validation of cache state compatibility and connection management.


230-237: LGTM! Clean synchronous send implementation.

The sendSync method properly retrieves the session and delegates to the formatter, with appropriate error checking.


405-418: LGTM! Consistent constructor pattern.

The CacheReceiver::Impl constructor follows the same validation and initialization pattern as the sender.


465-529: LGTM! Comprehensive request info sending with agent support.

The sendRequestInfo method properly handles both agent and direct connection scenarios, with appropriate buffer management and connection selection logic.


531-547: LGTM! Thread-safe resource management with proper scoping.

The getReceiveCacheResource method uses proper scoped locking and resource management patterns for concurrent access.


549-559: LGTM! Clean request info serialization.

The sendRequestInfo helper method properly serializes and sends request information using the defined communication protocol.


700-735: LGTM! Clean public interface implementation.

The public interface methods properly delegate to the pImpl pattern implementations with appropriate parameter forwarding and resource management.


736-736: LGTM! Proper namespace closure comment.

The namespace closure comment follows the coding guidelines requirement.


267-345: Thread name and exception handling verified

The thread name “dataTransResp” is only set here and does not conflict with other threads. Exception propagation via set_exception(std::current_exception()) matches the pattern used elsewhere (e.g., in workerPool.h). No further changes are needed.


115-132: Ensure ConnectionManager Access Is Thread-Safe

I didn’t find any explicit synchronization (mutexes/locks) around calls to mManager in this class. Since you spawn a background thread via std::async for Impl::response, concurrent calls to mManager could race unless ConnectionManager itself is fully thread-safe.

Please confirm or address this:

• Constructor (lines 115–132): validates and reads from mManager
• Impl::response (launched async): invokes mManager->recvConnect(...) and other methods
receiveAsync()/other methods: call mManager->getConnections(...)

If executor::kv_cache::ConnectionManager is not inherently thread-safe, wrap all shared calls in a mutex or otherwise serialize access to mManager.

Comment on lines +161 to +174
[[nodiscard]] size_t getCounterpartsCount(LlmRequest::RequestIdType requestId) const
{
auto it = mRequestToSession.find(requestId);
TLLM_CHECK(it != mRequestToSession.end());
return it->second.getConnections().size();
}

void release(LlmRequest::RequestIdType requestId)
{
auto it = mRequestToSession.find(requestId);
TLLM_CHECK(it != mRequestToSession.end());
std::unique_lock<std::mutex> lk(mMtxForMap);
mRequestToSession.erase(it);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Check thread safety of session map access.

The getCounterpartsCount and release methods access mRequestToSession map. Verify that all map accesses are properly synchronized.

 [[nodiscard]] size_t getCounterpartsCount(LlmRequest::RequestIdType requestId) const
 {
+    std::unique_lock<std::mutex> lk(mMtxForMap);
     auto it = mRequestToSession.find(requestId);
     TLLM_CHECK(it != mRequestToSession.end());
     return it->second.getConnections().size();
 }
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp around lines 161 to 174,
the methods getCounterpartsCount and release access the mRequestToSession map
without synchronization, which can cause race conditions. To fix this, ensure
that all accesses to mRequestToSession are protected by acquiring the mutex
mMtxForMap before accessing or modifying the map. Add a
std::unique_lock<std::mutex> lock(mMtxForMap) at the start of
getCounterpartsCount and confirm the existing lock in release covers the map
access properly.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12929 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #9638 completed with status: 'FAILURE'

@Tabrizian Tabrizian force-pushed the user/imant/useHashIds branch 2 times, most recently from ddc66c6 to 7d3e452 Compare July 25, 2025 19:27
@Tabrizian
Copy link
Member Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (2)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

18-18: Replace #pragma once with required preprocessor guard.

The coding guidelines specifically require preprocessor guards in header files with the format TRTLLM_ followed by the filename in caps.

Replace the current directive with the required guard format:

-#pragma once
+#ifndef TRTLLM_DATATRANSCEIVER_H
+#define TRTLLM_DATATRANSCEIVER_H

And add the closing directive at the end of the file:

 } // namespace tensorrt_llm::batch_manager
+
+#endif // TRTLLM_DATATRANSCEIVER_H
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (1)

225-261: Fix member variable naming convention.

The member variable kvCacheMeasureHelper should follow the coding guidelines for class member variables.

Apply this fix:

-    KvCacheMeasureHelper kvCacheMeasureHelper{common::getEnvKVCacheTransferOutputPath()};
+    KvCacheMeasureHelper mKvCacheMeasureHelper{common::getEnvKVCacheTransferOutputPath()};

Class member variables should use camelCase prefixed with 'm' according to the coding guidelines.

♻️ Duplicate comments (2)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)

161-174: Thread safety issue with session map access.

The getCounterpartsCount method accesses mRequestToSession without proper synchronization, while release uses mutex protection. This creates a race condition.

This is the same thread safety issue identified in previous reviews. The getCounterpartsCount method needs mutex protection:

 [[nodiscard]] size_t getCounterpartsCount(LlmRequest::RequestIdType requestId) const
 {
+    std::unique_lock<std::mutex> lk(mMtxForMap);
     auto it = mRequestToSession.find(requestId);
     TLLM_CHECK(it != mRequestToSession.end());
     return it->second.getConnections().size();
 }

420-424: TODO: Address inefficient thread creation pattern.

The method creates a new thread for each async receive operation, which is inefficient as noted in the TODO comment.

This matches the previous review feedback about frequent thread creation. The TODO comment correctly identifies this as needing improvement. Consider using the existing requestAndReceiveAsyncMultiThreads pattern or implementing a thread pool.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 90d8774 and 7d3e452.

📒 Files selected for processing (13)
  • cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (2 hunks)
  • cpp/tensorrt_llm/batch_manager/CMakeLists.txt (0 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h (3 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (7 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (15 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h (3 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (0 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h (0 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (2 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (2 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp (1 hunks)
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp (11 hunks)
  • cpp/tests/unit_tests/executor/ucxCommTest.cpp (0 hunks)
💤 Files with no reviewable changes (4)
  • cpp/tensorrt_llm/batch_manager/CMakeLists.txt
  • cpp/tests/unit_tests/executor/ucxCommTest.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
✅ Files skipped from review due to trivial changes (1)
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp
🚧 Files skipped from review as they are similar to previous changes (5)
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
  • cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
  • cpp/tests/batch_manager/cacheTransceiverTest.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo)
Prefer const or constexpr variables over #defines whenever possible, as the latter are not visible to the compiler.
A variable that is not modified after its initialization should be declared as const.
Except 0, nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do .. while or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with first letter lowercase (e.g., thisIsAFilename.cpp), and must be case-insensitive unique within a compilation target.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by a lower case 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by a lower case 's' (e.g., sMutableStaticGlobal).
Locally visible static variable should use camel case with lowercase prefix 's' as the first letter of the name (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with an 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants are uppercase snakecase with prefix...

Files:

  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
**/*.{h,hpp}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Use a preprocessor guard in header files. The guard name must have prefix TRTLLM_ followed by the filename, all in caps, and no trailing underscore.

Files:

  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
🧠 Learnings (3)
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

Learnt from: amitz-nv
PR: #5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks is_adapter_in_cpu_cache() and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.

🔇 Additional comments (24)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (6)

1-16: LGTM! Copyright header complies with coding guidelines.

The NVIDIA copyright header includes the current year (2025) as required by the coding guidelines for TensorRT-LLM Open Source Software.


24-24: Good addition of cacheFormatter.h include.

The new include aligns with the refactored design that moves TransferSession definition to the kv_cache_manager namespace.


42-58: Well-designed TransceiverTag structure follows coding guidelines.

The type aliases and TransceiverTag struct properly follow C++ naming conventions:

  • SizeType32 alias uses camelCase as required
  • TransferSession alias provides clear abstraction boundary
  • Enum values use uppercase naming (REQUEST_SEND, TERMINATION)
  • Static constexpr constants use kCONSTANT_NAME pattern with proper k prefix

115-158: CacheSender class design follows best practices.

The refactored class demonstrates good design principles:

  • Proper UpperCamelCase naming for class name
  • Method names use lowerCamelCase convention
  • Smart pointer usage with unique_ptr for resource management
  • Pimpl pattern reduces compilation dependencies
  • Constructor parameters use appropriate types (ConnectionManager*, CacheState, etc.)

160-182: CacheReceiver class maintains design consistency.

The class follows the same excellent design patterns as CacheSender:

  • Consistent naming conventions and interface design
  • Symmetric constructor signature with CacheSender
  • Appropriate method names for receiving operations
  • Proper use of pimpl pattern for implementation hiding

184-184: Perfect namespace closing comment.

The closing brace comment follows the coding guidelines requirement exactly: } // namespace tensorrt_llm::batch_manager

cpp/tensorrt_llm/batch_manager/cacheFormatter.h (4)

1-16: Copyright header meets coding guidelines.

The NVIDIA copyright header correctly includes the current year (2025) as required for TensorRT-LLM Open Source Software.


45-121: TransferSession class demonstrates excellent C++ design practices.

The class properly follows all coding guidelines:

  • UpperCamelCase class name
  • Member variables with proper 'm' prefix (mConnections, mDataContext, etc.)
  • Method names use lowerCamelCase convention
  • Proper const-correctness with const methods and const references
  • Good use of TLLM_CHECK for assertions
  • Clean initializer list in constructor

123-161: BaseCacheFormatter interface demonstrates clean design.

The interface properly follows coding guidelines:

  • Method names use lowerCamelCase convention
  • Proper virtual destructor
  • Clean signature with TransferSession& parameter
  • Appropriate const-correctness
  • Good use of [[nodiscard]] attribute for return values

266-266: Perfect namespace closing comment.

The closing brace comment properly follows coding guidelines with the complete namespace path: } // namespace tensorrt_llm::batch_manager::kv_cache_manager

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (14)

1-16: Copyright header complies with coding guidelines.

The NVIDIA copyright header correctly includes the current year (2025) as required for TensorRT-LLM Open Source Software.


38-57: Well-designed utility functions and structures.

The code follows C++ coding guidelines properly:

  • tagFromRequestId function uses lowerCamelCase and static linkage
  • ReceiveCacheResource struct uses UpperCamelCase naming
  • Member variables correctly use 'm' prefix (mBufferManager, mCudaEvent)
  • Constant kDATA_TAG follows uppercase with 'k' prefix convention

115-133: CacheSender::Impl constructor follows best practices.

The constructor demonstrates good C++ practices:

  • Proper member initialization list
  • Smart pointer usage with unique_ptr
  • Appropriate validation with TLLM_CHECK
  • Member variables follow 'm' prefix convention

176-228: Complex but well-structured request info handling.

The method demonstrates good practices:

  • Proper type checking with dynamic_cast for agent connections
  • Clean lambda usage for agent-specific logic
  • Correct mutex synchronization for mRequestToSession access
  • Good validation of formatter compatibility between cache states
  • Appropriate session creation and connection management

230-237: Clean and focused sendSync implementation.

The method properly delegates to formatter and manages the session state appropriately.


251-403: Robust threading and response management.

The response handling demonstrates excellent threading practices:

  • Proper exception handling with promise/future pattern
  • Correct CUDA device management across threads
  • Thread naming for debugging (dataTransResp)
  • Proper condition variable usage for synchronization
  • Clean termination handling in destructor

405-418: CacheReceiver::Impl constructor follows consistent design.

The constructor maintains the same excellent practices as CacheSender::Impl with proper initialization and validation.


426-458: Excellent async request handling with thread pool pattern.

This method demonstrates the proper solution for efficient async operations:

  • Resource pooling per process to avoid thread creation overhead
  • Queue-based task distribution with worker threads
  • Proper synchronization with mutex and condition variables
  • Good exception handling and future management
  • This addresses the inefficiency noted in the receiveAsync TODO

460-463: Clean receiveSync delegation.

Simple and appropriate delegation to the formatter's unformat method.


465-559: Comprehensive sendRequestInfo implementation.

The method handles complex scenarios well:

  • Proper validation of formatter support between cache states
  • Clean handling of both agent and regular connection managers
  • Appropriate serialization of request information
  • Good resource management with TransferSession creation
  • Proper exception handling and validation throughout

531-559: Solid resource management with thread safety.

The resource management demonstrates good practices:

  • Thread-safe access with scoped_lock<mutex>
  • Lazy initialization of per-process resources
  • Proper RAII with unique_ptr and BufferManager
  • Clean helper method for connection communication

561-698: Excellent async resource management and cleanup.

The implementation demonstrates robust design:

  • Proper RAII in destructor with resource cleanup and future synchronization
  • Good CUDA device management and logging in requestSync
  • Well-designed AsyncResource structure for thread coordination
  • Proper move semantics in RequestAndPromise
  • Comprehensive error handling with exception propagation
  • Thread naming for debugging (dataTransRequest)

700-734: Clean public interface following pimpl pattern.

The public interface demonstrates excellent design:

  • Proper delegation to implementation classes
  • Clean constructor forwarding with smart pointer management
  • Appropriate method delegation maintaining interface contracts
  • Correctly defaulted destructors for pimpl pattern

736-737: Perfect namespace closing comment.

The closing brace comment properly follows coding guidelines: } // namespace tensorrt_llm::batch_manager

@@ -18,11 +18,11 @@
#pragma once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Replace #pragma once with required preprocessor guard.

The coding guidelines mandate preprocessor guards in header files with the format TRTLLM_ followed by the filename in caps.

Replace with the required guard format:

-#pragma once
+#ifndef TRTLLM_CACHEFORMATTER_H
+#define TRTLLM_CACHEFORMATTER_H

And add the closing directive at the end of the file:

 } // namespace tensorrt_llm::batch_manager::kv_cache_manager
+
+#endif // TRTLLM_CACHEFORMATTER_H
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.h at line 18, replace the
#pragma once directive with a preprocessor guard following the coding
guidelines. Use the format TRTLLM_CACHEFORMATTER_H for the guard macro. Add
#ifndef TRTLLM_CACHEFORMATTER_H and #define TRTLLM_CACHEFORMATTER_H at the top
of the file, and add the corresponding #endif at the end of the file to close
the guard.

Comment on lines 163 to 221
class KvCacheMeasureHelper
{
public:
KvCacheMeasureHelper(std::string output_path)
: mOutputPath(std::move(output_path))
{
}

void appendKVCacheTransfer(LlmRequest::RequestIdType requestId, double duration, size_t size)
{
auto bandwidth = size * 8 / (duration / 1000) / 1e9;
if (mOutputPath.empty())
{
return;
}

std::lock_guard<std::mutex> lock(mMutex);
mRequestKVCacheTranfserMeasure[requestId].emplace_back(duration, bandwidth);
}

~KvCacheMeasureHelper()
{
if (!mRequestKVCacheTranfserMeasure.empty() && !mOutputPath.empty())
{
auto rank = mpi::MpiComm::world().getRank();
std::string outFilePath = mOutputPath + "rank_" + std::to_string(rank) + ".txt";
std::ofstream outFile(outFilePath);

TLLM_CHECK_WITH_INFO(outFile.is_open(), "Cannot write to file " + outFilePath);

size_t numTransferMeasure = mRequestKVCacheTranfserMeasure.begin()->second.size();

outFile << "RequestID";
for (size_t i = 0; i < numTransferMeasure; i++)
{
outFile << ",TimeDuration,Bandwidth";
}
outFile << '\n';

for (auto const& [requestID, measures] : mRequestKVCacheTranfserMeasure)
{
outFile << requestID;

for (auto const& [time, bandwidth] : measures)
{
outFile << "," << time << "," << bandwidth;
}
outFile << '\n';
}

outFile.close();
}
}

private:
std::map<LlmRequest::RequestIdType, std::vector<std::pair<double, double>>> mRequestKVCacheTranfserMeasure;
std::string mOutputPath;
std::mutex mMutex;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Fix typo in member variable name and improve parameter naming.

The class implementation is good but has minor naming issues:

Apply these fixes:

-class KvCacheMeasureHelper
-{
-public:
-    KvCacheMeasureHelper(std::string output_path)
-        : mOutputPath(std::move(output_path))
+class KvCacheMeasureHelper
+{
+public:
+    KvCacheMeasureHelper(std::string outputPath)
+        : mOutputPath(std::move(outputPath))
-    std::map<LlmRequest::RequestIdType, std::vector<std::pair<double, double>>> mRequestKVCacheTranfserMeasure;
+    std::map<LlmRequest::RequestIdType, std::vector<std::pair<double, double>>> mRequestKVCacheTransferMeasure;

And update the corresponding usage in the destructor.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.h between lines 163 and 221,
fix the typo in the member variable name from mRequestKVCacheTranfserMeasure to
mRequestKVCacheTransferMeasure consistently throughout the class, including in
appendKVCacheTransfer and the destructor. Also, rename the parameters in
appendKVCacheTransfer to more descriptive names such as requestId, durationMs,
and sizeBytes for clarity. Update all usages of these variables accordingly.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13033 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #13033 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9736 completed with status: 'FAILURE'

Signed-off-by: Iman Tabrizian <[email protected]>
Signed-off-by: Iman Tabrizian <[email protected]>
@Tabrizian Tabrizian force-pushed the user/imant/useHashIds branch from 7d3e452 to 1a21dba Compare August 12, 2025 20:06
@Tabrizian Tabrizian requested review from a team as code owners August 12, 2025 20:06
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🔭 Outside diff range comments (1)
cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1)

17-17: Replace #pragma once with required preprocessor guard.

Header files in this project must use include guards named TRTLLM_<FILENAME>_H without underscores prefix/suffix.

Replace with the required guard format:

-#pragma once
+#ifndef TRTLLM_KVCACHEUTILS_H
+#define TRTLLM_KVCACHEUTILS_H

And add the closing directive at the end of the file:

 } // namespace tensorrt_llm::batch_manager::kv_cache_manager
+
+#endif // TRTLLM_KVCACHEUTILS_H
♻️ Duplicate comments (4)
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (2)

18-18: Replace #pragma once with required preprocessor guard.

The coding guidelines mandate preprocessor guards in header files with the format TRTLLM_ followed by the filename in caps.

Replace with the required guard format:

-#pragma once
+#ifndef TRTLLM_CACHEFORMATTER_H
+#define TRTLLM_CACHEFORMATTER_H

And add the closing directive at the end of the file:

 } // namespace tensorrt_llm::batch_manager::kv_cache_manager
+
+#endif // TRTLLM_CACHEFORMATTER_H

111-111: Fix typo in member variable name.

The member variable name contains a typo: "Tranfser" should be "Transfer".

Apply this fix:

-    std::map<LlmRequest::RequestIdType, std::vector<Measure>> mRequestKVCacheTranfserMeasure;
+    std::map<LlmRequest::RequestIdType, std::vector<Measure>> mRequestKVCacheTransferMeasure;

Also update the usage in line 71:

-        mRequestKVCacheTranfserMeasure[requestId].emplace_back(Measure{delay, duration, bandwidth});
+        mRequestKVCacheTransferMeasure[requestId].emplace_back(Measure{delay, duration, bandwidth});

And in the destructor (lines 76 and 86):

-        if (!mRequestKVCacheTranfserMeasure.empty() && !mOutputPath.empty())
+        if (!mRequestKVCacheTransferMeasure.empty() && !mOutputPath.empty())
-            size_t numTransferMeasure = mRequestKVCacheTranfserMeasure.begin()->second.size();
+            size_t numTransferMeasure = mRequestKVCacheTransferMeasure.begin()->second.size();

And line 95:

-            for (auto const& [requestID, measures] : mRequestKVCacheTranfserMeasure)
+            for (auto const& [requestID, measures] : mRequestKVCacheTransferMeasure)
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)

161-174: Thread safety issue in getCounterpartsCount method.

The method accesses mRequestToSession without acquiring the mutex, which could cause race conditions with concurrent modifications.

 [[nodiscard]] size_t getCounterpartsCount(LlmRequest::RequestIdType requestId) const
 {
+    std::unique_lock<std::mutex> lk(mMtxForMap);
     auto it = mRequestToSession.find(requestId);
     TLLM_CHECK(it != mRequestToSession.end());
     return it->second.getConnections().size();
 }

430-434: TODO: Address frequent thread creation issue.

The current implementation creates a new thread for each async receive operation, which is inefficient as noted in the TODO comment.

Consider using the existing requestAndReceiveAsyncMultiThreads method or implementing a thread pool:

 [[nodiscard]] std::future<void> receiveAsync(LlmRequest& llmRequest)
 {
-    // TODO: Modify the implementation here to avoid frequent thread creation.
-    return std::async(std::launch::async, &CacheReceiver::Impl::requestSync, this, std::ref(llmRequest));
+    return requestAndReceiveAsyncMultiThreads(llmRequest);
 }
🧹 Nitpick comments (4)
cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (1)

42-44: Consider improving function signature consistency.

The function now accepts blockHashes but internally converts them to block IDs in line 55. Consider either:

  1. Renaming the parameter to better reflect its dual purpose
  2. Accepting block IDs directly if the caller already has them

Consider this alternative approach for better clarity:

 BlockRange getBlockRangeForSending(
-    BaseKVCacheManager* cacheManager, LlmRequest const& llmRequest, std::vector<size_t> const& blockHashes)
+    BaseKVCacheManager* cacheManager, LlmRequest const& llmRequest, std::vector<size_t> const& blockHashesOrIds)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (1)

1-2: Update copyright year to 2025 for new methods.

The file header indicates copyright ending in 2024, but the changes are being made in 2025. Consider updating the copyright year to reflect when these new methods are being added.

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

45-56: Consider making tag constants more maintainable.

The magic numbers for tags (19, 22, 32) should be documented or derived from a base value to make their purpose clearer.

 struct TransceiverTag
 {
     enum class Id : uint64_t
     {
         REQUEST_SEND = 1,
         TERMINATION = 2
     };
 
-    static constexpr int32_t kID_TAG{19};
-    static constexpr int32_t kINFO_SIZE_TAG{22};
-    static constexpr int32_t kINFO_TAG{32};
+    // Base tag for transceiver operations
+    static constexpr int32_t kBASE_TAG{19};
+    static constexpr int32_t kID_TAG{kBASE_TAG};
+    static constexpr int32_t kINFO_SIZE_TAG{kBASE_TAG + 3};
+    static constexpr int32_t kINFO_TAG{kBASE_TAG + 13};
 };
cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (1)

475-537: Complex connection management in sendRequestInfo.

The method handles both regular and agent-based connections with extensive branching. Consider extracting the agent-specific logic to a separate method for better maintainability.

 TransferSession sendRequestInfo(LlmRequest const& llmRequest)
 {
     // ... initial setup ...
     
-    auto* agentConnectionManager = dynamic_cast<executor::kv_cache::AgentConnectionManager*>(mManager);
-    std::optional<size_t> cacheBufferId = std::nullopt;
-    if (agentConnectionManager != nullptr)
-    {
-        cacheBufferId = agentConnectionManager->getCacheTransBufferManager()->assignBufferIndexForRecv();
-        TLLM_CHECK(cacheBufferId.has_value());
-        // memory Desp , validSegmentIdx send
-    }
+    auto cacheBufferId = prepareAgentCacheBuffer();
     
     // ... connection setup ...
     
     for (size_t i = 0; i < counterPartConnections.size(); i++)
     {
         auto const* connection = counterPartConnections[i];
-        // if Manager is agentConnectionManager, then send request info to agent
-        auto* agentConnectionManager = dynamic_cast<executor::kv_cache::AgentConnectionManager*>(mManager);
-        if (agentConnectionManager != nullptr)
-        {
-            // TODO: index -> validConnectionIdx conversion
-            auto valideConnectionIdx = std::find(pickUpIdx.begin(), pickUpIdx.end(), i) - pickUpIdx.begin();
-            auto* agentConnection = dynamic_cast<executor::kv_cache::AgentConnection const*>(connection);
-            TLLM_CHECK(agentConnection != nullptr);
-            TLLM_CHECK(cacheBufferId.has_value());
-            const_cast<executor::kv_cache::AgentConnection*>(agentConnection)
-                ->sendRequestAndBufferInfo(requestInfo, cacheBufferId, valideConnectionIdx);
-        }
-        else
-        {
-            sendRequestInfo(connection, requestInfo);
-        }
+        sendRequestInfoToConnection(connection, requestInfo, cacheBufferId, pickUpIdx, i);
     }
     // ... rest of method ...
 }

+private:
+    std::optional<size_t> prepareAgentCacheBuffer()
+    {
+        auto* agentManager = dynamic_cast<executor::kv_cache::AgentConnectionManager*>(mManager);
+        if (agentManager != nullptr)
+        {
+            auto bufferId = agentManager->getCacheTransBufferManager()->assignBufferIndexForRecv();
+            TLLM_CHECK(bufferId.has_value());
+            return bufferId;
+        }
+        return std::nullopt;
+    }
+
+    void sendRequestInfoToConnection(executor::kv_cache::Connection const* connection,
+        RequestInfo const& requestInfo, std::optional<size_t> cacheBufferId,
+        std::vector<size_t> const& pickUpIdx, size_t connectionIdx)
+    {
+        if (cacheBufferId.has_value())
+        {
+            auto validIdx = std::find(pickUpIdx.begin(), pickUpIdx.end(), connectionIdx) - pickUpIdx.begin();
+            auto* agentConn = dynamic_cast<executor::kv_cache::AgentConnection const*>(connection);
+            TLLM_CHECK(agentConn != nullptr);
+            const_cast<executor::kv_cache::AgentConnection*>(agentConn)
+                ->sendRequestAndBufferInfo(requestInfo, cacheBufferId, validIdx);
+        }
+        else
+        {
+            sendRequestInfo(connection, requestInfo);
+        }
+    }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7d3e452 and 1a21dba.

📒 Files selected for processing (18)
  • cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h (2 hunks)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (5 hunks)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1 hunks)
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h (0 hunks)
  • cpp/tensorrt_llm/batch_manager/CMakeLists.txt (0 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (3 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h (2 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (7 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (12 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h (3 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp (0 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h (0 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3 hunks)
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2 hunks)
  • cpp/tensorrt_llm/common/envUtils.cpp (0 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (2 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h (2 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp (1 hunks)
💤 Files with no reviewable changes (5)
  • cpp/tensorrt_llm/batch_manager/CMakeLists.txt
  • cpp/tensorrt_llm/common/envUtils.cpp
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp
🚧 Files skipped from review as they are similar to previous changes (5)
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.h
  • cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
  • cpp/tensorrt_llm/executor/cache_transmission/ucx_utils/connection.cpp
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
🧰 Additional context used
📓 Path-based instructions (5)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
🧠 Learnings (3)
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
📚 Learning: 2025-08-12T10:28:57.287Z
Learnt from: CR
PR: NVIDIA/TensorRT-LLM#0
File: CODING_GUIDELINES.md:0-0
Timestamp: 2025-08-12T10:28:57.287Z
Learning: Applies to **/*.{h,hpp,hxx,hh,cuh} : Header files must use include guards named TRTLLM_<FILENAME>_H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Applied to files:

  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h
📚 Learning: 2025-08-08T05:06:31.537Z
Learnt from: sklevtsov-nvidia
PR: NVIDIA/TensorRT-LLM#3294
File: cpp/tensorrt_llm/cutlass_extensions/include/cutlass_extensions/epilogue/fusion/sm90_visitor_scatter.hpp:36-36
Timestamp: 2025-08-08T05:06:31.537Z
Learning: CUTLASS extension files (under cpp/tensorrt_llm/cutlass_extensions/) follow CUTLASS coding style conventions, including using #pragma once instead of TRTLLM_ prefixed header guards, even though they are .hpp files.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (12)
cpp/tensorrt_llm/batch_manager/cacheFormatter.h (1)

52-53: Constructor parameter name should follow coding guidelines.

Parameter names should use lowerCamelCase without underscores according to the coding guidelines.

Apply this fix:

-    KvCacheMeasureHelper(std::string output_path)
-        : mOutputPath(std::move(output_path))
+    KvCacheMeasureHelper(std::string outputPath)
+        : mOutputPath(std::move(outputPath))
cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (1)

96-97: LGTM! Correctly integrates session-based block hash transfer.

The implementation properly retrieves block hashes from the session and passes them to the updated getBlockRangeForSending function, aligning with the new session-based transfer architecture.

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)

1011-1015: LGTM! Properly implements block hash lookup in reuse tree.

The implementation correctly delegates to the cached blocks root to find matching blocks by hash, supporting the partial reuse and copy-on-partial-reuse configurations.


1468-1474: LGTM! Clean implementation of block pinning across window managers.

The implementation correctly iterates through all window managers to pin blocks for the given sequence, properly propagating the pinning request throughout the hierarchy.


1476-1484: LGTM! Correctly increments reference counts for pinned blocks.

The implementation properly pins all allocated blocks for a sequence by incrementing their reference counts, which will prevent them from being evicted.


2117-2121: LGTM! Clean public API for block pinning.

The implementation provides a simple and clean interface for pinning blocks by request ID, properly retrieving the sequence and delegating to the block manager.

cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (2)

158-159: LGTM! Correctly integrates block hashes from session.

The implementation properly retrieves block hashes from the session and passes them to the updated getBlockRangeForSending function.


49-56: Selective transfer gating intentionally replaced
The DisableSelectiveCacheTransfer env‐var check has been removed and gating is now driven by poolNum and isEnableBlockReuse(). We confirmed:

  • getEnvDisableSelectiveCacheTransfer() is only declared (no usages)
  • In getBlockRangeForSending:
    • If poolNum>1 or !isEnableBlockReuse(), all blocks are sent
    • Otherwise (single pool & reuse enabled), only the requested blockHashes are sent
  • The receiving side (getBlockRangeForReceiving) uses the same single‐pool/reuse condition for symmetry

No changes are needed—the new logic is intentional and covers all scenarios.

cpp/tensorrt_llm/batch_manager/dataTransceiver.h (2)

1-16: Copyright header updated to 2025 - looks good!

The copyright header correctly reflects the 2025 date for this new implementation.


115-158: Well-structured PIMPL implementation for CacheSender.

The class follows the PIMPL idiom correctly with clear method signatures and proper const-correctness. The documentation is comprehensive.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (2)

240-277: Good error handling in sendAndRemoveResponse.

The method properly handles exceptions and ensures promises are fulfilled or rejected appropriately. The noexcept specification is correct since exceptions are caught internally.


539-555: Resource management looks solid.

The getReceiveCacheResource method properly manages per-process resources with appropriate synchronization and lazy initialization.

Comment on lines +569 to +581
void pinBlocks(GenerationRequest& sequence);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add documentation for the new pinBlocks method.

The new pinBlocks method lacks documentation explaining its purpose, behavior, and when it should be called. This is particularly important as it appears to be part of the public API.

Add documentation before the method declaration:

+    //! \brief Pin blocks associated with a sequence to prevent eviction.
+    //! \param sequence The generation request whose blocks should be pinned.
+    //! \details This method marks blocks as pinned in the KV cache to ensure they
+    //!          remain available for reuse across multiple requests.
     void pinBlocks(GenerationRequest& sequence);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
void pinBlocks(GenerationRequest& sequence);
//! \brief Pin blocks associated with a sequence to prevent eviction.
//! \param sequence The generation request whose blocks should be pinned.
//! \details This method marks blocks as pinned in the KV cache to ensure they
//! remain available for reuse across multiple requests.
void pinBlocks(GenerationRequest& sequence);
🤖 Prompt for AI Agents
In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around lines 569-570,
the public method declaration void pinBlocks(GenerationRequest& sequence); is
missing documentation; add a brief doxygen-style comment immediately above the
declaration describing purpose (what pinning does), expected behavior (which
blocks are pinned, lifetime and thread-safety expectations), when callers should
invoke it (e.g., before generation, after allocation), parameters (explain
GenerationRequest reference and ownership/constness), return/exception behavior
(whether it can fail, how failures are reported), and any side effects (impact
on memory and eviction). Ensure the comment is concise, uses @param and
@throws/@note as appropriate, and matches the style of surrounding method docs
in this header.

Comment on lines 1074 to 1083
[[nodiscard]] WindowBlockManager::BlockMapIterRange getBlocksByHash(size_t hash, SizeType32 windowSize) const
{
return mWindowBlockManagers.at(windowSize).getBlocksByHash(hash);
}

[[nodiscard]] BlockPtr const& getBlockFromReuseTreeByHash(size_t hash, SizeType32 windowSize) const
{
return mWindowBlockManagers.at(windowSize).getBlockFromReuseTreeByHash(hash);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add bounds checking and documentation for new hash-based lookup methods.

The new hash-based lookup methods getBlocksByHash and getBlockFromReuseTreeByHash access the window manager map without bounds checking and lack documentation.

+    //! \brief Get blocks matching the given hash for a specific window size.
+    //! \param hash The hash value to search for.
+    //! \param windowSize The window size to search within.
+    //! \return Iterator range over matching blocks.
     [[nodiscard]] WindowBlockManager::BlockMapIterRange getBlocksByHash(size_t hash, SizeType32 windowSize) const
     {
+        auto it = mWindowBlockManagers.find(windowSize);
+        TLLM_CHECK_WITH_INFO(it != mWindowBlockManagers.end(), 
+            "Window size %d not found in window block managers", windowSize);
-        return mWindowBlockManagers.at(windowSize).getBlocksByHash(hash);
+        return it->second.getBlocksByHash(hash);
     }

+    //! \brief Get a block from the reuse tree matching the given hash.
+    //! \param hash The hash value to search for.
+    //! \param windowSize The window size to search within.
+    //! \return Reference to the matching block pointer.
     [[nodiscard]] BlockPtr const& getBlockFromReuseTreeByHash(size_t hash, SizeType32 windowSize) const
     {
+        auto it = mWindowBlockManagers.find(windowSize);
+        TLLM_CHECK_WITH_INFO(it != mWindowBlockManagers.end(), 
+            "Window size %d not found in window block managers", windowSize);
-        return mWindowBlockManagers.at(windowSize).getBlockFromReuseTreeByHash(hash);
+        return it->second.getBlockFromReuseTreeByHash(hash);
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
[[nodiscard]] WindowBlockManager::BlockMapIterRange getBlocksByHash(size_t hash, SizeType32 windowSize) const
{
return mWindowBlockManagers.at(windowSize).getBlocksByHash(hash);
}
[[nodiscard]] BlockPtr const& getBlockFromReuseTreeByHash(size_t hash, SizeType32 windowSize) const
{
return mWindowBlockManagers.at(windowSize).getBlockFromReuseTreeByHash(hash);
}
//! \brief Get blocks matching the given hash for a specific window size.
//! \param hash The hash value to search for.
//! \param windowSize The window size to search within.
//! \return Iterator range over matching blocks.
[[nodiscard]] WindowBlockManager::BlockMapIterRange getBlocksByHash(size_t hash, SizeType32 windowSize) const
{
auto it = mWindowBlockManagers.find(windowSize);
TLLM_CHECK_WITH_INFO(it != mWindowBlockManagers.end(),
"Window size %d not found in window block managers", windowSize);
return it->second.getBlocksByHash(hash);
}
//! \brief Get a block from the reuse tree matching the given hash.
//! \param hash The hash value to search for.
//! \param windowSize The window size to search within.
//! \return Reference to the matching block pointer.
[[nodiscard]] BlockPtr const& getBlockFromReuseTreeByHash(size_t hash, SizeType32 windowSize) const
{
auto it = mWindowBlockManagers.find(windowSize);
TLLM_CHECK_WITH_INFO(it != mWindowBlockManagers.end(),
"Window size %d not found in window block managers", windowSize);
return it->second.getBlockFromReuseTreeByHash(hash);
}
🤖 Prompt for AI Agents
In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around lines
1074-1083, the two new methods call mWindowBlockManagers.at(windowSize) without
validating that windowSize exists and they lack documentation; update both
methods to first check that mWindowBlockManagers contains windowSize and handle
the missing-key case gracefully (e.g., return an empty BlockMapIterRange for
getBlocksByHash and return a safe null-equivalent BlockPtr reference or throw a
clear std::out_of_range with contextual message for
getBlockFromReuseTreeByHash), and add Doxygen-style comments above each method
describing the purpose, parameters, return value, and exception behavior so
callers know what to expect.

Comment on lines +1218 to +1222
virtual void pinBlocks(LlmRequest::RequestIdType requestId) = 0;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Document the pure virtual pinBlocks method in the base class.

The pure virtual pinBlocks method in BaseKVCacheManager lacks documentation explaining the contract that derived classes must fulfill.

+    //! \brief Pin blocks associated with a request to prevent eviction.
+    //! \param requestId The ID of the request whose blocks should be pinned.
+    //! \details Implementations should mark all blocks associated with the request
+    //!          as pinned to ensure they remain available for potential reuse.
     virtual void pinBlocks(LlmRequest::RequestIdType requestId) = 0;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
virtual void pinBlocks(LlmRequest::RequestIdType requestId) = 0;
//! \brief Pin blocks associated with a request to prevent eviction.
//! \param requestId The ID of the request whose blocks should be pinned.
//! \details Implementations should mark all blocks associated with the request
//! as pinned to ensure they remain available for potential reuse.
virtual void pinBlocks(LlmRequest::RequestIdType requestId) = 0;
🤖 Prompt for AI Agents
In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around lines
1218-1219, the pure virtual method `pinBlocks(LlmRequest::RequestIdType
requestId)` in BaseKVCacheManager is undocumented; add a brief doxygen-style
comment above the declaration that states the method's contract: what pinning
means, when it will be called, expected behavior (e.g., mark blocks for the
given requestId as pinned to prevent eviction), whether it must be thread-safe,
return/exception expectations (void, should not throw), and any required side
effects or preconditions so derived classes implement consistent behavior.

Comment on lines +1607 to +1611
void pinBlocks(LlmRequest::RequestIdType requestId);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing implementation declaration for pinBlocks override.

The pinBlocks method is declared without the override keyword, which could lead to subtle bugs if the signature doesn't match the base class exactly.

-    void pinBlocks(LlmRequest::RequestIdType requestId);
+    void pinBlocks(LlmRequest::RequestIdType requestId) override;
🤖 Prompt for AI Agents
In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around lines
1607-1608, the method declaration for pinBlocks is missing the override
specifier; update the declaration to match and explicitly mark it as overriding
the base class (e.g. change to "void pinBlocks(LlmRequest::RequestIdType
requestId) override;"), and verify the signature (const/&, noexcept) exactly
matches the base class method so the override is valid.

// TODO(zhengd): pass the hashes directly instead of from llmRequest; use hash instead of block num
auto const& ids = blockRange.getBlockIds();
blockRange.setBlockIds({ids.end() - requestBlockNum, ids.end()});
blockRange.setBlockIds(blockHashes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Type mismatch: setBlockIds expects vector but receives vector<size_t>.

The blockRange.setBlockIds() method expects std::vector<SizeType32> according to kvCacheUtils.h, but blockHashes is of type std::vector<size_t>. This could cause issues on platforms where size_t and SizeType32 have different sizes.

Apply this fix to ensure type safety:

     if (requestBlockNum < blockRange.size() && requestBlockNum > 0)
     {
-        blockRange.setBlockIds(blockHashes);
+        blockRange.setBlockIdsFromHashes(blockHashes);
     }

This uses the correct method that accepts std::vector<size_t> for hashes.

🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp around line 55,
blockRange.setBlockIds(blockHashes) passes a std::vector<size_t> to a function
expecting std::vector<SizeType32>; replace the call with the overload that
accepts std::vector<size_t> (e.g., blockRange.setBlockHashes(blockHashes)) so
the types match and avoid unsafe casts or platform-dependent size mismatches.

Comment on lines +168 to +174
void release(LlmRequest::RequestIdType requestId)
{
auto it = mRequestToSession.find(requestId);
TLLM_CHECK(it != mRequestToSession.end());
std::unique_lock<std::mutex> lk(mMtxForMap);
mRequestToSession.erase(it);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Lock acquired too late in release method.

The iterator is obtained before acquiring the lock, which could lead to iterator invalidation if another thread modifies the map.

 void release(LlmRequest::RequestIdType requestId)
 {
+    std::unique_lock<std::mutex> lk(mMtxForMap);
     auto it = mRequestToSession.find(requestId);
     TLLM_CHECK(it != mRequestToSession.end());
-    std::unique_lock<std::mutex> lk(mMtxForMap);
     mRequestToSession.erase(it);
 }
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp around lines 168-174, the
code calls mRequestToSession.find(requestId) before acquiring mMtxForMap, which
can allow the map to be modified and invalidate the iterator; move the lock
acquisition to before the find, perform the lookup while holding the mutex,
check the iterator under the lock (TLLM_CHECK or equivalent), then erase under
the same lock (use std::unique_lock<std::mutex> or std::lock_guard<std::mutex>
to hold mMtxForMap for the duration of find + check + erase).

Comment on lines +137 to +138
/// @return The request information.
[[nodiscard]] RequestInfo recvRequestInfo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Inconsistent method signatures between CacheSender and CacheReceiver.

The sendRequestInfo and receiveSync methods appear in both classes but with different purposes, which could be confusing:

  • CacheSender has sendRequestInfo returning TransferSession and receiveSync taking TransferSession&
  • CacheReceiver has the same methods with similar signatures

This naming overlap could lead to confusion about which class is responsible for what operations.

Consider renaming to clarify the direction of data flow:

 class CacheSender
 {
     // ...
-    [[nodiscard]] TransferSession sendRequestInfo(LlmRequest const& llmRequest);
-    void receiveSync(TransferSession& session);
+    [[nodiscard]] TransferSession prepareSendSession(LlmRequest const& llmRequest);
+    void completeSendSession(TransferSession& session);
 };

 class CacheReceiver
 {
     // ...
-    TransferSession sendRequestInfo(LlmRequest const& llmRequest);
-    void receiveSync(TransferSession& session);
+    TransferSession prepareReceiveSession(LlmRequest const& llmRequest);
+    void completeReceiveSession(TransferSession& session);
 };

Also applies to: 173-175

@Tabrizian Tabrizian changed the title Refactor dataTransciever classes feat: Add support for KVCache transfer from KVCache reuse path Aug 12, 2025
@Tabrizian Tabrizian force-pushed the user/imant/useHashIds branch from 1a21dba to bd2757d Compare August 13, 2025 00:27
@Tabrizian Tabrizian marked this pull request as draft August 13, 2025 00:28
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

♻️ Duplicate comments (8)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

144-151: Inconsistent method signatures between CacheSender and CacheReceiver.

cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (3)

161-166: Check thread safety of session map access.


168-174: Lock acquired too late in release method.


430-434: Address TODO: avoid frequent thread creation.

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (4)

580-581: Add documentation for the new pinBlocks method.


1221-1222: Document the pure virtual pinBlocks method in the base class.


1610-1611: Missing implementation declaration for pinBlocks override.


901-901: Document the pinBlocks method in BlockManager.

The pinBlocks method in BlockManager needs documentation.

Add documentation:

+    /// @brief Pin blocks associated with a generation request to prevent eviction.
+    /// @param sequence The generation request whose blocks should be pinned.
     void pinBlocks(GenerationRequest& sequence);
🧹 Nitpick comments (1)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

45-56: Consider making the tag constants more maintainable.

The TransceiverTag struct with enum Id and the tag constants looks good. However, the hardcoded integer values (19, 22, 32) for tags might benefit from documentation or being derived from the enum values to avoid magic numbers.

Consider adding a comment explaining the choice of these specific tag values:

 struct TransceiverTag
 {
     enum class Id : uint64_t
     {
         REQUEST_SEND = 1,
         TERMINATION = 2
     };

+    // Tag values for MPI/communication - chosen to avoid conflicts with other subsystems
     static constexpr int32_t kID_TAG{19};
     static constexpr int32_t kINFO_SIZE_TAG{22};
     static constexpr int32_t kINFO_TAG{32};
 };
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a21dba and bd2757d.

📒 Files selected for processing (13)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (8 hunks)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h (1 hunks)
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h (0 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp (3 hunks)
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h (2 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp (12 hunks)
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h (3 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3 hunks)
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp (2 hunks)
  • cpp/tensorrt_llm/common/envUtils.cpp (0 hunks)
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp (2 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/py_executor.py (3 hunks)
💤 Files with no reviewable changes (2)
  • cpp/tensorrt_llm/common/envUtils.cpp
  • cpp/include/tensorrt_llm/batch_manager/llmRequest.h
🚧 Files skipped from review as they are similar to previous changes (5)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheUtils.h
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.cpp
  • cpp/tensorrt_llm/executor/cache_transmission/agent_utils/connection.cpp
  • cpp/tensorrt_llm/batch_manager/mlaCacheFormatter.cpp
  • cpp/tensorrt_llm/batch_manager/cacheFormatter.h
🧰 Additional context used
📓 Path-based instructions (6)
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh}: In C++, close namespaces with a comment naming the namespace (e.g., } // namespace foo)
Prefer const/constexpr variables over #define for constants
Declare variables const if not modified after initialization
Use Allman brace style in C++
C++ filenames use lowerCamelCase and must be case-insensitively unique within a build target
C++ type names use UpperCamelCase
Local variables, methods, and namespaces use lowerCamelCase
Global non-static variables not in anonymous namespace use gPrefix lowerCamelCase (e.g., gExample)
Static globals or globals in anonymous namespaces use sPrefix lowerCamelCase
Locally visible static variables start with 's' (e.g., static std::once_flag sFlag;)
Member variables use mPrefix lowerCamelCase; public members may omit but are encouraged to use 'm'
Constants (enums, global/static/function-scope magic numbers) use kPREFIXED_UPPER_SNAKE (e.g., kDIGIT_NUM)
If macros are unavoidable, use UPPER_SNAKE_CASE (prefer constants over #define)
Constructor parameter that conflicts with a public member name gets trailing underscore (foo_)
Literal suffixes should be uppercase (e.g., 1234L not 1234l)
C++: use spaces only; indent 4 spaces
Run clang-format (LLVM style) before submitting; wrap lines at 120 characters
If formatting must be bypassed, use // clang-format off/on around the section
Prefer smart pointers; use unique_ptr for sole ownership, shared_ptr for shared; weak_ptr only in exceptional cases
Do not use deprecated pre-C++11 smart pointers
Use C++ style comments; avoid C comments except special inline cases; prefer // single-line
Capitalize and punctuate full-sentence comments
Follow Doxygen rules: use //! for comments and //!< for members in C++
Disable code with #if/#endif and mnemonic conditions; avoid commented-out code; avoid dead code
Do not throw exceptions across library boundaries
Use least-forceful casts; avoid removing const/volatile; avoid C-style and functional casts (except constructors); p...

Files:

  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,cxx,cc,cu}: Avoid literal values except for 0, nullptr, true, false; use named constexpr for other literals
Place semicolon of empty for/while loop on a new line
Always use brace-delimited bodies for switch/while/do-for/if/else
Use inline C comments in argument lists when parameter meaning is unclear (e.g., /* checkForErrors = */ false)
Do not use assignment in subexpressions (e.g., if (x = y) ... is forbidden)
Switch on enums should enumerate all values and omit default to catch new values at compile time
Structure switch statements; prohibit fallthrough except between empty cases; each case ends with break or throw; return at end of case not allowed; put break inside braces for compound case
Prefer anonymous namespaces over static for internal linkage of functions
Every defined function must be called at least once (no unused methods)

Files:

  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{h,hpp,hxx,hh,cuh,cpp,cxx,cc,cu}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Parameter names must be consistent between declarations and definitions

Files:

  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.{cpp,cxx,cc,cu,h,hpp,hxx,hh,cuh,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files

Files:

  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
  • tensorrt_llm/_torch/pyexecutor/py_executor.py
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
**/*.py

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Python indentation: 4 spaces, no tabs
Maintain module namespace in imports (from package.subpackage import foo; then use foo.SomeClass())
Python file names use snake_case
Python class names use PascalCase
Python functions/methods and local variables use snake_case; variables starting with a number get k_ prefix (e.g., k_99th_percentile)
Global variables use G_ prefixed UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE in Python
Avoid shadowing variables from outer scopes in Python
Initialize all externally visible members of a Python class in init
Prefer docstrings for interfaces used outside a file; comments for local code
Use Google-style docstrings for classes and functions (Sphinx-parsable)
Document attributes/variables inline with short docstrings
Avoid reflection when simple alternatives exist (e.g., prefer explicit parameters over dict(**locals()))
In try/except, catch the narrowest exceptions possible
For duck-typing with try/except, keep try body minimal and put logic in else

Files:

  • tensorrt_llm/_torch/pyexecutor/py_executor.py
**/*.{h,hpp,hxx,hh,cuh}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

Header files must use include guards named TRTLLM__H without underscores prefix/suffix (e.g., TRTLLM_FOO_BAR_HELLO_H)

Files:

  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
🧠 Learnings (1)
📚 Learning: 2025-08-06T08:18:28.669Z
Learnt from: zhengd-nv
PR: NVIDIA/TensorRT-LLM#6633
File: cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp:145-155
Timestamp: 2025-08-06T08:18:28.669Z
Learning: In cpp/tensorrt_llm/batch_manager/dataTransceiverImpl.cpp, the existing `mMtxForMap` mutex in DataSenderImpl is sufficient to synchronize measurement file operations in the `release` method, as all file operations occur within the same critical section that protects the `mRequestToSession` map access.

Applied to files:

  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.h
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (10)
cpp/tensorrt_llm/batch_manager/dataTransceiver.h (1)

42-44: LGTM! Type aliases are well-defined.

The new type aliases SizeType32, TransferSession, and BaseCacheFormatter are correctly defined and improve code readability.

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1)

348-348: LGTM! Python binding for pinBlocks is correctly added.

The binding for pinBlocks is properly placed after scheduling_remove_sequence and correctly exposes the C++ method to Python as pin_blocks.

tensorrt_llm/_torch/pyexecutor/py_executor.py (2)

183-183: LGTM! Block reuse flag properly initialized.

The block_reuse_enabled attribute is correctly initialized from the model engine's KV cache configuration.


1641-1645: LGTM! Proper handling of block pinning based on reuse flag.

The conditional logic correctly pins blocks when block_reuse_enabled is True and maintains backward compatibility when it's False.

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)

118-119: LGTM! Hash member added to BlockKey.

The addition of the hash member to BlockKey is a good optimization to avoid recomputing hashes.


174-178: Good optimization for hash caching.

The modification to return the cached hash when available is an efficient optimization.

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (4)

1-16: LGTM: Properly formatted copyright header.

The NVIDIA copyright header is correctly formatted with the current year (2025) and follows the Apache 2.0 license template as required by the coding guidelines.


1485-1491: LGTM: Well-structured pin blocks delegation.

The function properly delegates pinning operations to all window block managers following the established pattern used throughout the codebase.


1493-1501: LGTM: Correct implementation of block pinning.

The function correctly increases the reference count for all allocated blocks belonging to the sequence, which prevents them from being evicted. This aligns with the intended pinning mechanism.


2134-2138: LGTM: Clean API for KV cache level block pinning.

The function provides a clean interface for pinning blocks at the KVCacheManager level by resolving the sequence from requestId and delegating to the BlockManager. This follows good encapsulation principles.

Comment on lines +132 to +135
explicit BlockKey(size_t hash)
: hash{hash}
{
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Document the purpose of the hash-only constructor.

The constructor that takes only a hash value needs documentation explaining when it should be used.

Add documentation:

+    /// @brief Construct a BlockKey with a precomputed hash.
+    /// @param hash The precomputed hash value.
+    /// @details Used when retrieving blocks by hash from the reuse tree.
     explicit BlockKey(size_t hash)
         : hash{hash}
     {
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
explicit BlockKey(size_t hash)
: hash{hash}
{
}
/// @brief Construct a BlockKey with a precomputed hash.
/// @param hash The precomputed hash value.
/// @details Used when retrieving blocks by hash from the reuse tree.
explicit BlockKey(size_t hash)
: hash{hash}
{
}
🤖 Prompt for AI Agents
In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around lines 132 to
135, add a documentation comment for the explicit BlockKey(size_t hash)
constructor that explains it constructs a key from a precomputed hash (used for
lookups/comparisons where only the hash is available), clarifies it does not
populate other key fields, notes any invariants or expected source of the hash
(e.g., result of BlockKey::hash() or external hashing function), and documents
intended usage scenarios (fast-lookup, temporary key for map/set operations) and
ownership/lifetime implications.

@@ -737,6 +750,8 @@
return 0;
}

[[nodiscard]] std::optional<KVCacheBlock> findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Document the new hash-based lookup method.

The findBlocksInReuseTreeByHashes method needs documentation explaining its purpose and return value.

Add documentation:

+    /// @brief Find blocks in the reuse tree matching the given hashes.
+    /// @param hashes Vector of hash values to search for.
+    /// @return Optional KVCacheBlock if matching blocks are found.
     [[nodiscard]] std::optional<KVCacheBlock> findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
[[nodiscard]] std::optional<KVCacheBlock> findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const;
/// @brief Find blocks in the reuse tree matching the given hashes.
/// @param hashes Vector of hash values to search for.
/// @return Optional KVCacheBlock if matching blocks are found.
[[nodiscard]] std::optional<KVCacheBlock> findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const;
🤖 Prompt for AI Agents
In cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h around line 753, the
new method declaration findBlocksInReuseTreeByHashes lacks documentation; add a
Doxygen-style comment above it that briefly describes the method's purpose
(lookup KVCacheBlock(s) in the reuse tree by a vector of hash values), explains
the parameter (hashes: vector of size_t hash keys to match), describes the
return value (std::optional<KVCacheBlock> — present when a matching block is
found, std::nullopt when none match), and any thread-safety or performance notes
(e.g., const method, does not modify state, complexity if relevant). Ensure the
comment is concise and follows existing file doc style.

Comment on lines +41 to +45
static int32_t tagFromRequestId(LlmRequest::RequestIdType requestId)
{
constexpr int32_t kDATA_TAG{43};
return ((requestId & 0xFFF) << 8) | (kDATA_TAG & 0xFF);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Document the magic number in tagFromRequestId.

The function uses a magic number 0xFFF for masking and 43 as a base tag. These should be documented or made configurable.

Add documentation and consider using named constants:

+// Generate a unique tag for MPI communication based on request ID
+// Uses lower 12 bits of request ID to avoid tag overflow
+static constexpr int32_t kRequestIdMask{0xFFF};
+static constexpr int32_t kDataTagBase{43};
+
 static int32_t tagFromRequestId(LlmRequest::RequestIdType requestId)
 {
-    constexpr int32_t kDATA_TAG{43};
-    return ((requestId & 0xFFF) << 8) | (kDATA_TAG & 0xFF);
+    return ((requestId & kRequestIdMask) << 8) | (kDataTagBase & 0xFF);
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
static int32_t tagFromRequestId(LlmRequest::RequestIdType requestId)
{
constexpr int32_t kDATA_TAG{43};
return ((requestId & 0xFFF) << 8) | (kDATA_TAG & 0xFF);
}
// Generate a unique tag for MPI communication based on request ID
// Uses lower 12 bits of request ID to avoid tag overflow
static constexpr int32_t kRequestIdMask{0xFFF};
static constexpr int32_t kDataTagBase{43};
static int32_t tagFromRequestId(LlmRequest::RequestIdType requestId)
{
return ((requestId & kRequestIdMask) << 8) | (kDataTagBase & 0xFF);
}
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp around lines 41 to 45, the
bitmask 0xFFF and the numeric tag 43 are used without explanation; replace these
magic numbers with clearly named constexpr constants (e.g., kRequestIdMask =
0xFFF, kRequestIdShift = 8, kDataTag = 43 or an enum for tag types), add a short
comment documenting the tag layout (how many bits are reserved for requestId,
where the tag sits, why 0xFFF is chosen), and consider making kDataTag
configurable at a higher scope if different tags may be needed; then update the
function to use those named constants so the intent is explicit and
maintainable.

Comment on lines +136 to +138
/// @brief Receive the request information.
/// @return The request information.
[[nodiscard]] RequestInfo recvRequestInfo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Document the purpose of new methods in CacheSender.

The recvRequestInfo method lacks documentation explaining its purpose and when it should be called.

Add documentation:

+    /// @brief Receive request information from a peer.
+    /// @return The request information received.
     [[nodiscard]] RequestInfo recvRequestInfo();

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/dataTransceiver.h around lines 136-138, the
recvRequestInfo() declaration lacks a descriptive comment; add a doc comment
that states this method receives and returns the next RequestInfo from the
transceiver, describes the expected caller (e.g., worker or consumer thread),
when it should be called (blocking vs non-blocking behavior, or only after a
prior handshake/enqueue), what the return value represents and ownership
semantics, and any thread-safety or error conditions (e.g., may throw or return
empty/default if closed). Ensure the comment is concise and placed immediately
above the method declaration.

Comment on lines +140 to +142
/// @brief Synchronously send data.
/// @param llmRequest The request object to which the data belongs.
void sendSync(LlmRequest const& llmRequest);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Document the synchronous send method.

The sendSync method lacks documentation explaining its blocking behavior and prerequisites.

Add documentation:

+    /// @brief Synchronously send data for the given request.
+    /// @param llmRequest The request object containing the data to send.
+    /// @details This method blocks until the data is sent. The request's data must be ready when called.
     void sendSync(LlmRequest const& llmRequest);

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/dataTransceiver.h around lines 140-142, the
sendSync declaration lacks a detailed doc comment describing that it is a
blocking, synchronous send; update the Doxygen comment to state that sendSync
blocks the caller until the data for the provided LlmRequest is fully
transmitted (or an acknowledgement/confirmation is received), list preconditions
(e.g., required buffer/state, not to be called from UI/main thread), specify
thread-safety (whether it is thread-safe or must be serialized by caller),
document failure modes and how they are reported (exceptions thrown or error
return semantics) and any timeout or cancellation behavior; keep the comment
concise and actionable so callers know when to use sendSync vs async send.

Comment on lines +1011 to +1032
std::optional<KVCacheBlock> WindowBlockManager::findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const
{
std::vector<BlockKey> blockKeys;
for (auto const& hash : hashes)
{
blockKeys.emplace_back(hash);
}

auto searchRoot = mCachedBlocksRoot;
for (auto const& blockKey : blockKeys)
{
auto [partialMatch, numMatched, matchingBlock] = searchRoot != nullptr
? searchRoot->findMatchingBlock(blockKey, false, false)
: std::make_tuple(false, 0, nullptr);
if (matchingBlock == nullptr)
{
return nullptr;
}
searchRoot = std::move(matchingBlock);
}
return searchRoot;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add input validation and improve error handling.

The function doesn't validate the input hashes parameter and could return unexpected results if an empty vector is passed.

Apply this diff to add input validation:

 std::optional<KVCacheBlock> WindowBlockManager::findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const
 {
+    if (hashes.empty())
+    {
+        return std::nullopt;
+    }
+    
     std::vector<BlockKey> blockKeys;
     for (auto const& hash : hashes)
     {
         blockKeys.emplace_back(hash);
     }

     auto searchRoot = mCachedBlocksRoot;
     for (auto const& blockKey : blockKeys)
     {
         auto [partialMatch, numMatched, matchingBlock] = searchRoot != nullptr
             ? searchRoot->findMatchingBlock(blockKey, false, false)
             : std::make_tuple(false, 0, nullptr);
         if (matchingBlock == nullptr)
         {
-            return nullptr;
+            return std::nullopt;
         }
         searchRoot = std::move(matchingBlock);
     }
     return searchRoot;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::optional<KVCacheBlock> WindowBlockManager::findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const
{
std::vector<BlockKey> blockKeys;
for (auto const& hash : hashes)
{
blockKeys.emplace_back(hash);
}
auto searchRoot = mCachedBlocksRoot;
for (auto const& blockKey : blockKeys)
{
auto [partialMatch, numMatched, matchingBlock] = searchRoot != nullptr
? searchRoot->findMatchingBlock(blockKey, false, false)
: std::make_tuple(false, 0, nullptr);
if (matchingBlock == nullptr)
{
return nullptr;
}
searchRoot = std::move(matchingBlock);
}
return searchRoot;
}
std::optional<KVCacheBlock> WindowBlockManager::findBlocksInReuseTreeByHashes(std::vector<size_t> const& hashes) const
{
if (hashes.empty())
{
return std::nullopt;
}
std::vector<BlockKey> blockKeys;
for (auto const& hash : hashes)
{
blockKeys.emplace_back(hash);
}
auto searchRoot = mCachedBlocksRoot;
for (auto const& blockKey : blockKeys)
{
auto [partialMatch, numMatched, matchingBlock] = searchRoot != nullptr
? searchRoot->findMatchingBlock(blockKey, false, false)
: std::make_tuple(false, 0, nullptr);
if (matchingBlock == nullptr)
{
return std::nullopt;
}
searchRoot = std::move(matchingBlock);
}
return searchRoot;
}
🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp around lines 1011 to 1032,
validate the input and improve null handling: if hashes.empty() immediately
return std::nullopt; if mCachedBlocksRoot is nullptr also return std::nullopt
before looping; inside the loop keep the existing search logic but when
matchingBlock is nullptr return std::nullopt (not nullptr); at the end, return a
populated std::optional<KVCacheBlock> constructed from the found block (or
std::nullopt if no block), ensuring you convert pointer results into the
optional return type consistently.

💡 Verification agent

🧩 Analysis chain

Verify integration with new BlockKey hash constructor.

The function relies on BlockKey(hash) constructor to initialize keys from precomputed hashes. Ensure this constructor is properly implemented and tested.


🏁 Script executed:

#!/bin/bash
# Search for BlockKey constructor that takes a hash parameter
ast-grep --pattern $'class BlockKey {
  $$$
  BlockKey($_, size_t $_) {
    $$$
  }
  $$$
}'

# Also search for any BlockKey constructor with single parameter
ast-grep --pattern $'BlockKey(size_t $_)'

Length of output: 136


🏁 Script executed:

#!/bin/bash
# Search for BlockKey definition and any constructors across the repo
rg "class BlockKey" -n -C3
rg "struct BlockKey" -n -C3

# Look for constructor implementations and declarations
rg "BlockKey::BlockKey" -n -C3
rg "explicit BlockKey" -n -C3
rg "BlockKey(size_t" -n -C3

Length of output: 3613


🏁 Script executed:

#!/bin/bash
# Display BlockKeyTest to verify coverage of the size_t constructor
sed -n '1,200p' cpp/tests/batch_manager/blockKeyTest.cpp

Length of output: 674


Ensure unit tests cover the hash‐based BlockKey constructor

The explicit BlockKey(size_t) ctor in kvCacheManager.h (lines 132–136) is used by findBlocksInReuseTreeByHashes but isn’t exercised by any existing test. Please add or extend unit tests to cover:

  • Direct construction via BlockKey(hash)
  • Its interaction with findMatchingBlock/findBlocksInReuseTreeByHashes

This will validate that precomputed hashes are handled correctly.

🤖 Prompt for AI Agents
In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp around lines 1011 to 1032,
add unit tests that exercise the explicit BlockKey(size_t) constructor and its
use in findBlocksInReuseTreeByHashes: construct BlockKey objects directly from
precomputed hashes (BlockKey(hash)), insert/create cached blocks in the reuse
tree as needed, call findBlocksInReuseTreeByHashes with vectors of those hashes,
and assert the function returns the expected KVCacheBlock (positive case) and
returns null/empty when hashes don’t match (negative case); include a test that
verifies findMatchingBlock works equivalently when using the size_t ctor versus
the standard ctor to ensure hash-based lookup behaves correctly.

Comment on lines +961 to +965
if not self.block_reuse_enabled:
ctx_transmission_reqs = self._send_disagg_ctx_cache(
scheduled_batch.context_requests
) if self.kv_cache_transceiver else []

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Review the duplicate disagg context cache sending logic.

The code appears to have duplicate logic for sending disagg context cache - once when block_reuse_enabled is False and again unconditionally. This could lead to sending the cache twice.

The logic should be consolidated:

-    if not self.block_reuse_enabled:
-        ctx_transmission_reqs = self._send_disagg_ctx_cache(
-            scheduled_batch.context_requests
-        ) if self.kv_cache_transceiver else []
-
     if self.kv_cache_transceiver:
         ctx_transmission_reqs = self._send_disagg_ctx_cache(
             scheduled_batch.context_requests)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In tensorrt_llm/_torch/pyexecutor/py_executor.py around lines 961 to 965, there
is duplicate logic that can send the disagg context cache twice: first inside
the if not self.block_reuse_enabled branch and then again unconditionally;
consolidate this by sending the cache exactly once—either keep the send inside
the block_reuse_enabled=False branch and remove the unconditional send, or make
the unconditional send conditional on block_reuse_enabled being False; ensure
the send uses the same kv_cache_transceiver guard (i.e., only call
self._send_disagg_ctx_cache(...) when self.kv_cache_transceiver is truthy) and
remove the redundant call so the cache cannot be sent twice.

@@ -80,6 +80,19 @@ class BlockRange
mBlockIds = std::move(blockIds);
}

void setBlockIdsFromHashes(std::vector<size_t> blockHashes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const std::vector<size_t>& is better here.

@@ -39,23 +39,20 @@
namespace tensorrt_llm::batch_manager::kv_cache_manager
{

BlockRange getBlockRangeForSending(BaseKVCacheManager* cacheManager, LlmRequest const& llmRequest)
BlockRange getBlockRangeForSending(
BaseKVCacheManager* cacheManager, LlmRequest const& llmRequest, std::vector<size_t> const& blockHashes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to remove the llmRequest parameter in the future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants