-
Notifications
You must be signed in to change notification settings - Fork 1.7k
[None][feat] Support NVFP4 KV Cache #6244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughAdds end-to-end NVFP4 (FP4) KV-cache support: removes quantSize, introduces block-scale pool pointers and per-container element accounting, propagates FP4 block-scale buffers through attention/XQA/FMHA/kernels/bindings/torch layers, consolidates Q/K/V scale fields, and updates tests and resource sizing. Changes
Sequence Diagram(s)sequenceDiagram
participant Host
participant AttentionOp
participant XqaDispatcher
participant FMHA_Dispatcher
participant GPU_Kernels
rect rgb(240,248,255)
Host->>AttentionOp: enqueue(params, kv_cache_buffer, kv_cache_block_scales_buffer)
end
AttentionOp->>XqaDispatcher: run(xqaParams, kv_cache_buffer, kv_cache_block_scales_buffer)
XqaDispatcher->>FMHA_Dispatcher: run(..., pagedKvCache, pagedKvSfCache)
FMHA_Dispatcher->>GPU_Kernels: launch(kernelParams with kvPtr, kvSfPtr)
GPU_Kernels-->>Host: results
sequenceDiagram
participant Host
participant KVCacheManager
participant Pools
rect rgb(248,248,255)
Host->>KVCacheManager: construct WindowBlockManager(...)
KVCacheManager->>Pools: create data pools using sizePerHead / numEltsPerContainer
KVCacheManager->>Pools: create block-scale pools sized per quantBlock using blockScaleSizePerHead
Pools-->>Host: return data + block-scale pool pointers
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
Status, Documentation and Community
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplPrecompiled.cpp (1)
215-227
:kv_cache_block_scales_buffer
is left uninitialised – likely a functional bug for FP4 pathWith NVFP4 support, the preprocessing kernel expects a valid block-scale buffer.
Here the field is set to an empty brace-initialiser ({}
), while the immediately-adjacent scale pointerkv_scale_orig_quant
is taken fromxqaParams
.
Unless the preprocessing kernel is explicitly guarded to ignore the field for every non-FP4 run, this will dereference a null pointer once the FP4 KV-cache code-path is exercised.- preprocessingParams.kv_cache_block_scales_buffer = {}; + // Pass through the block-scale pool allocated in the caller + preprocessingParams.kv_cache_block_scales_buffer = xqaParams.kv_cache_block_scales_buffer;Please ensure:
xqaParams
exposeskv_cache_block_scales_buffer
for the generation phase, and- All kernel-launch helpers (
invokeQKVPreprocessing
, downstream FMHA) have the same expectation.Failing to wire this pointer will produce silent wrong results or a hard fault when FP4 is enabled.
♻️ Duplicate comments (21)
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1)
2-3
: Same provenance concern as aboveThe cubin hash & size changed but no build record is provided.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Same provenance concern as aboveThe cubin hash & size changed but no build record is provided.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Same provenance concern as aboveThe cubin hash & size changed but no build record is provided.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128StaticContext_cubin.cpp (1)
2-3
: Same provenance concern as aboveThe cubin hash & size changed but no build record is provided.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Same provenance concern as first cubinSee comment on the first cubin file; the same reproducibility note applies here.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Same provenance concern as first cubinSee comment on the first cubin file; the same reproducibility note applies here.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Same provenance concern as first cubinSee comment on the first cubin file; the same reproducibility note applies here.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1)
2-3
: Same provenance concern as first cubinSee comment on the first cubin file; the same reproducibility note applies here.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1)
1-4
: Same verification needed as for the first cubin update.See previous comment – identical reproducibility and registration checks apply here.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
1-4
: Same verification needed as for the first cubin update.See previous comment – identical reproducibility and registration checks apply here.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvCausalP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1)
1-4
: Same verification needed as for the first cubin update.See previous comment – identical reproducibility and registration checks apply here.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvCausalP32VarSeqQ128Kv128StaticContext_cubin.cpp (1)
1-4
: Same verification needed as for the first cubin update.See previous comment – identical reproducibility and registration checks apply here.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Same concern as in the previous cubin file: please document build provenance and guarantee test coverage for NVFP4 paths.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128StaticContext_cubin.cpp (1)
2-3
: Same concern as in the previous cubin file: please document build provenance and guarantee test coverage for NVFP4 paths.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvCausalP32VarSeqQ128Kv128StaticContext_cubin.cpp (1)
2-3
: Same concern as in the previous cubin file: please document build provenance and guarantee test coverage for NVFP4 paths.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1)
2-3
: Same concern as in the previous cubin file: please document build provenance and guarantee test coverage for NVFP4 paths.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Same verification needed for updated cubinNew hash
1428fe48…
, size596 486
. Re-run the HW/QA checks described in the previous comment.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1)
2-3
: Same verification needed for updated cubinHash
42163217…
, size647 265
.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Same verification needed for updated cubinHash
5a3141a6…
, size493 038
.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ128Kv128StaticContext_cubin.cpp (1)
2-3
: Same verification needed for updated cubinHash
d91996ed…
, size878 391
.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1)
2-3
: Same concern as above – confirm compatibility after sha changeEnsure the new 5f3641e5… cubin is detectable and selected by the runtime for H64 paged-KV variant; otherwise the kernel picker may silently fall back to a slower path.
🧹 Nitpick comments (12)
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Add provenance / build-time metadata for regenerated cubinOnly the LFS pointer hash & size changed. Without any accompanying information (PTX version, compiler flags, SM target, git SHA of the generator, etc.) it’s impossible for downstream maintainers to reproduce or audit the new binary, and CI artifact drift can creep in silently.
Please attach (e.g., in
docs/kernel_manifests/…json
) a small manifest that records:
• kernel name
• commit that produced it
• exactnvcc
/ptxas
flags & CUDA version
• git SHA of the source.cu
template
• date / authorThis keeps the binaries traceable and eases security review.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Document build provenance for the new CUBIN to keep the binary audit-readyOnly the LFS pointer (SHA-256 + size) changed, so reviewers cannot reason about functional differences.
Please record in the PR (or aBUILD.md
) the exact commit / branch of the kernel generator, compiler versions, and flags used to produce this cubin, or add a CI rule that rebuilds and diff-checks it. That keeps the artifact reproducible and prevents silent drift.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Confirm provenance & test coverage for the new cubinOnly the LFS pointer (SHA-256 + size) changed, so the actual binary was swapped in.
Because sources & build scripts for these kernels are not part of the repo, reviewers cannot independently verify that the new cubin really corresponds to the advertised NVFP4 support, compiler flags, etc.Action items
- Add a short note to the PR description (or a
BUILD.md
snippet) with:
• generator commit hash / tag
• compiler version & flags
• explicit opt-ins (e.g.-DNVFP4_KV_CACHE=ON
)
• SM target list- Ensure CI executes FMHA unit / integration tests with
DataType::kNvFp4
so the new blob is actually exercised.This makes future rebuilds reproducible and prevents silent regressions.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Minor: keep commit message listing all new hashesHelps future bisects if each updated cubin hash is recorded in the commit message or a CHANGELOG entry.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Cubin replaced – confirm deterministic rebuild.Because the binary shrank ~35 %, verify that:
- The compile flags (
--use_fast_math
, PTX version, etc.) match the rest of the FP4 cubins.- The new build is reproducible; otherwise, future patch releases may diverge.
If you have a reproducibility script, please drop a reference in
docs/release_build.md
.cpp/tensorrt_llm/common/attentionOp.h (1)
132-183
: Consider adding new pointers to debug output for consistency.The
enqueueContextParamsToString()
function includes the existinghost_primary_pool_pointer
andhost_secondary_pool_pointer
(lines 168-169) but doesn't include the newly added scale pool pointers. For debugging consistency, consider adding them:ss << "host_primary_pool_pointer: " << this->host_primary_pool_pointer << std::endl; ss << "host_secondary_pool_pointer: " << this->host_secondary_pool_pointer << std::endl; + ss << "host_primary_block_scale_pool_pointer: " << this->host_primary_block_scale_pool_pointer << std::endl; + ss << "host_secondary_block_scale_pool_pointer: " << this->host_secondary_block_scale_pool_pointer << std::endl;cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Dense-P64 cubin swapped – ensure LFS pointer committedJust a reminder: confirm
.gitattributes
still tracks*.cubin.cpp
withfilter=lfs
so the real binary is pulled in downstream clones; otherwise buildbots without LFS will fail at link time.cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)
537-538
: Avoid hard-coding magic constants; reuse implementation-side definitions
numFp4EltsPerContainer
andvectorSize
duplicate values that already exist in production code (kNumFp4EltsPerContainer
,kVectorSize
inkvCacheUtils.h
). Importing those symbols (or exposing them via a dedicated test header) prevents drift if the implementation changes.- auto constexpr numFp4EltsPerContainer = 2; - auto constexpr vectorSize = 16; +#include "tensorrt_llm/kernels/kvCacheUtils.h" + auto constexpr numFp4EltsPerContainer = tk::kNumFp4EltsPerContainer; + auto constexpr vectorSize = tk::kVectorSize;
560-563
: Assertion looks correct but can overflow with larger sizesThe expression
blockManager.getBlockSize(0) * numFp4EltsPerContainer
is 32-bit-multiplication (returnsSizeType32
). With larger heads / tokens an overflow is possible before the subsequent/ vectorSize
.
Cast to 64-bit to make the test future-proof:- EXPECT_EQ(blockManager.getBlockSize(0) * numFp4EltsPerContainer / vectorSize, - blockManager.getBlockSize(1)); + EXPECT_EQ(static_cast<std::uint64_t>(blockManager.getBlockSize(0)) * numFp4EltsPerContainer / vectorSize, + static_cast<std::uint64_t>(blockManager.getBlockSize(1)));cpp/tensorrt_llm/thop/attentionOp.cpp (2)
226-241
: Correct bit-level calculation for FP4 support.The change from byte-based to bit-based calculation properly handles FP4's 4-bit elements. Consider renaming the variable to
cache_elem_size_bits
for clarity.- // The cache element size in bits. - int cache_elem_bits; + // The cache element size in bits. + int cache_elem_size_bits;
738-738
: Correct extension of FP8ContextFMHA to support FP4.The flag now properly enables the quantized context FMHA path for both FP8 and FP4 KV cache modes. Consider adding a comment to clarify that this flag supports both quantization modes despite the "FP8" in its name.
cpp/tensorrt_llm/common/attentionOp.cpp (1)
2101-2117
: Consider refactoring duplicated element size calculation.This element size calculation logic is duplicated from the context phase (lines 1285-1299). Consider extracting it into a helper function to improve maintainability.
// Helper function to calculate element size based on quantization mode static int getElementBits(tc::QuantMode const& kvCacheQuantMode) { if (kvCacheQuantMode.hasInt8KvCache() || kvCacheQuantMode.hasFp8KvCache()) { return 8; } else if (kvCacheQuantMode.hasFp4KvCache()) { return 4; } else { return sizeof(T) * 8; } }
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (107)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(3 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(3 hunks)cpp/tensorrt_llm/common/attentionOp.cpp
(17 hunks)cpp/tensorrt_llm/common/attentionOp.h
(1 hunks)cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h
(1 hunks)cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/decoderXQAImplJIT.cpp
(1 hunks)cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplPrecompiled.cpp
(1 hunks)cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
(3 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvCausalP32VarSeqQ128Kv128StaticContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvCausalP64VarSeqQ128Kv128PersistentContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvCausalP64VarSeqQ128Kv128StaticContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ128Kv128PersistentContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ128Kv128StaticContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ128Kv128PersistentContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ128Kv128StaticContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128StaticContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ128Kv128PersistentContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ128Kv128StaticContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvCausalP32VarSeqQ128Kv128StaticContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvCausalP64VarSeqQ128Kv128PersistentContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvCausalP64VarSeqQ128Kv128StaticContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ128Kv128PersistentContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ128Kv128StaticContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ128Kv128PersistentContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ128Kv128StaticContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128StaticContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ128Kv128PersistentContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ128Kv128StaticContext_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaRunner.cpp
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaRunnerParams.h
(1 hunks)cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/kernelParams.h
(12 hunks)cpp/tensorrt_llm/kernels/unfusedAttentionKernels.h
(2 hunks)cpp/tensorrt_llm/kernels/unfusedAttentionKernels/unfusedAttentionKernels_2_template.h
(6 hunks)cpp/tensorrt_llm/kernels/xqaDispatcher.cpp
(8 hunks)cpp/tensorrt_llm/kernels/xqaDispatcher.h
(2 hunks)cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
(1 hunks)cpp/tensorrt_llm/pybind/bindings.cpp
(2 hunks)cpp/tensorrt_llm/thop/attentionOp.cpp
(10 hunks)cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp
(2 hunks)
⛔ Files not processed due to max files limit (14)
- cpp/tests/unit_tests/kernels/ropeTest.cu
- tensorrt_llm/_torch/attention_backend/trtllm.py
- tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
- tensorrt_llm/_torch/pyexecutor/_util.py
- tensorrt_llm/_torch/pyexecutor/model_engine.py
- tensorrt_llm/_torch/pyexecutor/resource_manager.py
- tensorrt_llm/_utils.py
- tensorrt_llm/llmapi/llm_utils.py
- tests/integration/defs/.test_durations
- tests/integration/defs/accuracy/references/gsm8k.yaml
- tests/integration/defs/accuracy/references/mmlu.yaml
- tests/integration/defs/accuracy/test_llm_api_pytorch.py
- tests/integration/test_lists/test-db/l0_b200.yml
- tests/integration/test_lists/test-db/l0_h100.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (112)
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Binary cubin reference updated – ensure provenanceOnly the Git-LFS pointer (OID & size) changed; there’s no reviewable source.
Please confirm the cubin was rebuilt from the matching source commit with reproducible flags and is covered by CI artifact signing or checksum validation.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ128Kv128StaticContext_cubin.cpp (1)
2-3
: Binary cubin reference updated – ensure provenanceSame note as above: only metadata changed.
Verify that the new cubin is reproducibly built and signed; otherwise downstream users cannot audit the binary.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1)
2-3
: Binary cubin reference updated – ensure provenanceNo source diff. Please keep build recipes or a hash manifest checked-in so the team can reproduce this cubin and verify it matches the committed SHA-256 hash.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1)
2-3
: Binary cubin reference updated – ensure provenanceAgain, only the LFS pointer changed. Please confirm the binary is built from tagged source; consider adding an automated check in CI that rebuilds and diff-checks cubins.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1)
2-3
: Binary cubin reference updated – ensure provenanceOnly metadata lines updated. Recommend documenting the compiler flags & commit SHA used to generate this cubin so future reviewers can reproduce or audit it.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1)
2-3
: Binary kernel artifact refreshed – looks fineOnly the LFS pointer (SHA-256 + size) changed. Nothing else to review at source level.
Make sure the rebuilt cubin is reproducible with the current build scripts and that CI/functional tests covering FP4 KV-cache passes with this new binary.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Updated cubin acknowledgedPointer update only. No actionable code issues. Please confirm this binary was compiled with the same tool-chain flags used for the rest of the FP4 roll-out.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1)
2-3
: Cubin hash/size updatedNo source changes present. Ensure the new kernel aligns with the unified KV scale-factor pointer introduced in the runner params.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Binary refresh notedNothing to review in text. Validate that performance baselines are unchanged/improved with this rebuilt cubin.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1)
2-3
: Acknowledged new cubin pointerNo code concerns. Recommend running regression tests that exercise persistent-context path.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1)
1-4
: Confirm reproducible build & runtime registration for updated cubin.Only the Git-LFS pointer (SHA / size) changed, so we cannot review the binary logic itself.
Please double-check that:
- The cubin was rebuilt with the same compiler flags (SM 100, PTX version,
-G
off, determinism enabled) as the rest of the FMHA set – mismatched flags silently break mixed-build deployments.- Every dispatcher table / manifest that lists built-in kernels is updated to reference this exact cubin name; otherwise the loader will fall back to an older artifact at runtime.
If you have a deterministic-build manifest, consider committing it next to the cubins to make future reviews easier.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Binary cubin updated – please confirm runtime compatibilityOnly the Git LFS pointer (OID/size) changed, indicating a rebuilt kernel binary. No source code is available for static review.
Make sure:
- The new cubin was compiled with the same (or newer, compatible) CUDA & SM 100 tool-chain flags expected by the dispatcher.
- Any hard-coded SHA/size checks in loader code are updated.
- CI runs end-to-end attention tests exercising FP4 KV-cache paths on an SM100 device.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1)
2-3
: Recompiled kernel artifact – ensure loader & params stayed in syncThe cubin’s hash/size changed; no textual diff is reviewable.
Please validate:• New kernel still matches updated
fmhaRunnerParams
(packed KV scale ptr).
• Performance regressions are tracked (new cubin is ~35 KB larger).
• Unit tests covering H128 paged-KV persistent path pass.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvCausalP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1)
2-3
: Large cubin replacement – double-check causal-context kernel launchGiven the FP4 KV-cache refactor, confirm that:
- Kernel launch code sets the packed-KV scale pointer for this causal variant.
- Memory footprint (TMEM descriptors) still fits occupancy targets on SM100.
- Regression tests with long-sequence causal generation succeed.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Size drop (~300 KB) – verify no feature lossThe new cubin is significantly smaller. Ensure build flags (e.g.,
-lineinfo
,-res-usage
) or disabled paths didn’t strip required functionality. Run sliding-window / chunked-causal benchmarks to catch latent issues.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Updated cubin – confirm CGA multi-CTAs path correctnessWith FP4 KV-cache support, multi-CTA CGA kernels are especially sensitive to scale-pool pointer propagation. Please:
• Run distributed-KV stress test.
• Check for mismatched quant-scale leading to NaNs.
• Capture new PTX-generated register usage to watch for spill regressions.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1)
2-3
: Confirm.cubin
pointer is committed via Git-LFSOnly the LFS pointer (not the raw binary) should live in Git. Please double-check that
*.cubin.cpp
is covered by the existing.gitattributes
and that CI still fetches the blobs.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Same LFS-tracking sanity checkEnsure the new SHA/object is pushed to the LFS store; otherwise downstream
git lfs pull
will fail.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ128Kv128StaticContext_cubin.cpp (1)
2-3
: Large-binary pointer update looks fineHash / size changed as expected after recompilation for FP4 support.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Reminder: keep binary artefacts out of review diffsNothing actionable, just noting the reduced cubin size (👍).
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1)
2-3
: Ensure the updated cubin is registered in all build/package manifestsBinary-only change looks fine, but please double-check that:
- The new SHA (
0bf00b1…
) is pushed to the LFS store.- Any CMake-style manifest or Python wheel packaging list has been updated so downstream consumers pull the refreshed cubin.
Missing one of these breaks runtime kernel loading on fresh clones/installs.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Verify kernel compatibility across driver/toolkit matrixSize dropped ~35 %. Great, but smaller cubins occasionally miss older-SM fallback code-paths.
Please run the usual SM70/80 smoke tests before merge.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: LGTM – binary refresh only
No concerns; compilation artefact update is expected with FP4 support.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Confirm CI caches get invalidatedBecause the file hash changed, cached Docker layers or pre-built wheels may still hold the old cubin.
Make sure the CI scripts force a clean rebuild (e.g., bump wheel version or clear$HOME/.cache/trtllm
).cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Binary artifact updated – please run a smoke test on SM100 hardware.The cubin hash and size changed, implying a full re-compile. Make sure:
- The CI matrix still flashes this cubin on an SM100 board.
- Runtime kernel selection logic (
fmhaDispatcher
) resolves to this exact cubin whendtype == NVFP4
.If either step is missing, the feature will silently fall back or seg-fault.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Large binary delta – double-check PTX/linker versions.The size drop (~33 %) may come from switching CUDA versions or new linker options. Ensure the PTX major/minor baked into the cubin is still ≥ the minimum supported by the production driver fleet (documented in
docs/compat_matrix.md
).cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Binary drop-in looks good.Nothing to flag here beyond the usual runtime-selection sanity checks already noted for the other cubins.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1)
2-3
: Persistent-context cubin updated – verify memory footprint.Persistent-context kernels keep on-device buffers alive between launches. The 40 % size reduction is great, but please profile that the saved registers/shared-mem did not regress occupancy vs. the previous build.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1)
2-3
: Binary-pointer updated – looks correctOnly the Git-LFS
oid
andsize
fields changed to reference the newly-built cubin. Format and placement remain valid; nothing else to review.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1)
2-3
: Binary-pointer updated – looks correctLFS pointer metadata refresh only. No issues spotted.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Binary-pointer updated – looks correctThe
oid
andsize
entries were swapped for the new cubin blob; pointer syntax remains valid.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Binary-pointer updated – looks correctNo functional code touched aside from LFS metadata. Good to go.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Binary-pointer updated – looks correctChange is limited to Git-LFS
oid
+size
. No further review needed.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Binary pointer update looks fineOnly the Git-LFS pointer hash/size changed. No source logic to review.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Binary pointer update looks fineOnly the Git-LFS pointer hash/size changed. No source logic to review.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Binary pointer update looks fineOnly the Git-LFS pointer hash/size changed. No source logic to review.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1)
2-3
: Binary pointer update looks fineOnly the Git-LFS pointer hash/size changed. No source logic to review.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Binary pointer update looks fineOnly the Git-LFS pointer hash/size changed. No source logic to review.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ128Kv128StaticContext_cubin.cpp (1)
2-3
: Verify that the new cubin is picked up by the runtime & CI artifacts.The SHA 256 and size have changed, but nothing in CMake / build scripts was updated in this PR. Please double-check that:
- The new
oid
is correctly staged in LFS (git lfs ls-files
).- Any internal hash maps / version tables that reference this cubin (e.g. in
fmhaDispatcher.cpp
) don’t hard-code the previousoid
.- CI runs an inference smoke-test on SM100 GPUs to ensure the kernel loads and produces identical numerical results.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Manual disassembly needed for SM100 cubinThe repository doesn’t contain a standalone
.cubin
artifact (it’s generated at build time), so we couldn’t automatically verify the presence of SM100 code objects. Please, after building the FMHA kernel, run:nvdisasm --print-code path/to/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen.cubin | head -n 20
and confirm you still see the
.sm_100
code objects in the output.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ128Kv128StaticContext_cubin.cpp (1)
2-3
: Check that dependent descriptor tables were regenerated together.This cubin participates in the dense-KV path. Any mismatch between its new TMA descriptors and the updated
kernelParams.h
will cause silent wrong-results. Make surekernelParamsUnitTest
(if present) is re-run against this binary.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Multi-CTA variant – validate scheduler compatibility.Because this variant relies on CUDA cooperative groups, ensure the kernel was re-compiled with the same
-arch=sm_100
and-rdc=false
flags; otherwise, multi-CTA synchronization may break on some driver versions.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvCausalP64VarSeqQ128Kv128StaticContext_cubin.cpp (1)
2-3
: Ensure causal-path perf parity after cubin shrink.The file size dropped ~39 %. Please compare end-to-end latency of a causal attention benchmark before/after this change to catch accidental register-spilling regressions.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1)
2-3
: Binary kernel update for FP4 KV cache support.The updated CUDA kernel binary reflects the coordinated recompilation to support NVFP4 KV cache quantization. The significant size reduction (from ~1.7MB to ~1MB) suggests optimization or updated compilation settings.
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/decoderXQAImplJIT.cpp (1)
306-306
: Naming consistency fix for KV scale quantization parameter.The field name change from
kvScaleOrigQuant
tokv_scale_orig_quant
aligns with the standardized snake_case naming convention for FP4 KV cache scale factor handling across the codebase.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1)
2-3
: Binary kernel update for FP4 KV cache support.The updated CUDA kernel binary is part of the comprehensive FP4 KV cache quantization support rollout. The significant size reduction (from ~1.4MB to ~875KB) indicates substantial optimization or updated compilation settings.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Binary kernel update for FP4 KV cache support.The updated CUDA kernel binary continues the pattern of coordinated recompilation for NVFP4 KV cache quantization support. The size reduction (from ~1MB to ~715KB) aligns with the optimization trend observed across all kernel updates.
cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp (3)
138-138
: Proper initialization of KV scale factor pool pointer.The
kvSfPoolPtr
is correctly initialized tonullptr
for safe handling of FP4 KV cache scale factors.
148-148
: Conditional assignment of KV scale factor pool pointer.The assignment from
runnerParams.pagedKvSfCache.mPrimaryPoolPtr
is correctly placed within theQ_PAGED_KV
layout condition, ensuring it's only used for paged KV cache scenarios where FP4 scale factors are relevant.
169-169
: Proper propagation of KV scale factor pointer to runner.The
kvSfPtr
field assignment correctly passes the FP4 KV cache scale factor buffer to the FMHA runner parameters, completing the integration of FP4 KV cache support in the dispatcher.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: CUDA kernel binary updated for FP4 KV cache support.The kernel binary has been recompiled with a 33% size reduction (887,191 → 583,642 bytes), likely reflecting optimizations for the new FP4 KV cache quantization mode support.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: CUDA kernel binary updated for FP4 KV cache support.The kernel binary has been recompiled with a 33% size reduction (873,881 → 579,976 bytes), consistent with the optimization pattern across FP4-related kernel updates.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvCausalP64VarSeqQ128Kv128StaticContext_cubin.cpp (1)
2-3
: Context CUDA kernel binary updated with significant optimization.The context kernel binary shows a 39% size reduction (1,397,181 → 853,331 bytes), indicating substantial optimizations for FP4 KV cache support in context attention scenarios.
cpp/tensorrt_llm/common/attentionOp.h (1)
95-96
: LGTM! Properly implemented FP4 KV cache scale pool pointers.The new pointer members are correctly initialized to
nullptr
and follow the consistent naming convention with existing pool pointer members.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: CUDA kernel binary updated for FP4 KV cache support.The H128 kernel binary shows a 33% size reduction (1,118,485 → 748,905 bytes), completing the systematic kernel recompilation for FP4 KV cache quantization support.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ128Kv128StaticContext_cubin.cpp (1)
2-3
: Binary artifact refreshed — please run smoke-tests on real HWWe can’t statically review
.cubin
contents.
Before merging, ensure the new build (hash19e710…
, size852 663
) is:• Executed on at least one SM100 board with end-to-end attention ops.
• Free of regressions vs. the previous cubin (perf / accuracy / crashes).
• Produced with the same compile flags & PTX ISA version expected by the runtime loader.If CI does not already validate this, please trigger a short “FP4 KV-cache” generation test-suite.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1)
2-3
: Validate cubin fingerprint & runtime load pathNew LFS pointer (sha256 = 7d9e853d…) replaces the previous kernel image.
Please make sure that:
- The loader side (FMHA runner / dispatcher) is not doing any hard-coded size or checksum validation that would now fail.
- CI executes at least one end-to-end inference that exercises this kernel on an SM100 board to guarantee the rebuilt cubin is actually loadable.
If neither check is in place, add a minimal smoke-test; otherwise confirm the existing tests pass.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Static-swap cubin updated – verify CI still covers this code pathThis variant is only hit with multi-CTA static-swap configs. Double-check the unit/functional tests still include that scenario; otherwise we risk shipping an untested cubin.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Check mixed P64 / H128 path after cubin refreshThis H128 / P64 kernel is critical for large-head models. Please verify the new cubin was compiled with the same max registers limit; regressions there often surface only at high batch sizes.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
1-4
: Binary kernel update for FP4 KV Cache support.The updated CUDA kernel binary shows a significant size reduction (34%) and new SHA-256 hash, indicating substantial changes to support NVFP4 KV Cache. The filename suggests specific support for E2M1 (FP4) format.
Since this is a binary artifact, please ensure comprehensive testing coverage for the new FP4 KV cache functionality to validate correctness and performance.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
1-4
: Consistent binary kernel update for FP4 KV Cache support.This kernel binary also shows a ~34% size reduction and updated hash, consistent with the systematic kernel updates for NVFP4 KV Cache support. The H64 configuration complements the H128 variant from the previous file.
The consistent size reduction pattern across kernel binaries suggests systematic optimization. Ensure testing covers both H64 and H128 head configurations.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaRunner.cpp (1)
40-42
: DATA_TYPE_E2M1 definition and usage confirmed
Verified that the new FP4 data type is declared and consistently referenced across the codebase:
- Defined in
multiHeadAttentionCommon.h
(enum Data_type
) and infused_multihead_attention_utils.h
- Used in FMHA runner validation (
fmhaRunner.cpp
), kernel parameters (kernelParams.h
), dispatcher (xqaDispatcher.cpp
), and attention operator (common/attentionOp.cpp
)No further changes required.
cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1)
358-369
: Well-implemented Python binding for block scale pool pointers.The new
get_block_scale_pool_pointers()
method follows the established pattern used by similar methods likeget_block_pool_pointers()
. The implementation correctly handles optional tensor conversion and maintains consistency with the existing codebase.cpp/tensorrt_llm/pybind/bindings.cpp (2)
160-160
: LGTM: Clean addition of NVFP4 data type enum.The new enum value correctly maps
"NVFP4"
tonvinfer1::DataType::kFP4
, enabling Python access to the FP4 data type for KV cache quantization.
239-239
: LGTM: Proper exposure of FP4 KV cache capability.The new read-only property
has_fp4_kv_cache
correctly exposes the C++ methodhasFp4KvCache()
, allowing Python code to query FP4 KV cache quantization mode support.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: Binary kernel artifact updated for FP4 support.The CUDA kernel binary has been updated with a new SHA256 hash and reduced size (945,675 → 602,808 bytes). This aligns with the comprehensive FP4 KV cache support being added throughout the codebase. Since this is a binary artifact, the actual kernel implementation changes cannot be reviewed at the source level.
cpp/tensorrt_llm/kernels/unfusedAttentionKernels/unfusedAttentionKernels_2_template.h (5)
584-584
: LGTM! Consistent variable renaming.The renaming from
kvScaleOrigQuant
tokv_scale_orig_quant
follows consistent naming conventions and maintains the same null-checking logic with appropriate fallback value.
636-639
: LGTM! Consistent renaming in FP4 cache handling.The variable renaming is applied consistently in the FP4 cache quantization logic while preserving the fallback mechanism between
kv_cache_scale_factors
andkv_scale_orig_quant
.
1002-1003
: LGTM! Consistent variable renaming in V2 kernel.The variable renaming is consistently applied in the
applyBiasRopeUpdateKVCacheV2
kernel function, maintaining the same null-checking logic and fallback behavior.
1059-1062
: LGTM! Consistent variable renaming in V2 FP4 handling.The variable renaming is consistently applied with proper null checking and fallback logic in the FP4 cache quantization section of the V2 kernel.
1428-1428
: LGTM! Consistent variable renaming in cross-attention kernel.The variable renaming follows the established pattern with proper null checking and fallback value in the cross-attention kernel function.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaRunnerParams.h (1)
195-196
: LGTM! Structural improvement for FP4 KV cache support.The addition of
kvSfPtr
as a packed KV scaling factor buffer consolidates the scaling factor handling, which aligns well with the FP4 KV cache quantization support being introduced in this PR. This unified approach should simplify the parameter passing and improve maintainability.cpp/tensorrt_llm/kernels/unfusedAttentionKernels.h (1)
127-127
: LGTM! Improved naming consistency.The renaming from
kvScaleOrigQuant
tokv_scale_orig_quant
follows the snake_case naming convention, which improves consistency across the codebase. ThetoString()
method has been correctly updated to reflect this change.Also applies to: 242-242
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)
611-620
: LGTM! Proper FP4 container packing implementation.The code correctly handles FP4 data type where two 4-bit elements are packed into one 8-bit container. The divisibility check ensures proper alignment, and the pool size calculation is appropriately adjusted.
Also applies to: 631-631
716-721
: LGTM! Consistent FP4 handling in block scale pools.The code maintains consistency with the FP4 packing scheme by properly accounting for
numEltsPerContainer
in both the divisibility check and the block scale size calculation.Also applies to: 726-729
751-765
: LGTM! Appropriate FP4 storage representation.The code correctly represents FP4 data as INT8 in memory pools, which aligns with the packing scheme where two 4-bit FP4 values are stored in one 8-bit container.
cpp/tensorrt_llm/kernels/xqaDispatcher.h (1)
81-82
: AllXqaDispatcher::run
call sites have been updated to pass three buffers.Both occurrences in
attentionOp.cpp
now call:
- Line 1209:
mXqaDispatcher->run(xqaParams, kv_cache_buffer, kv_scale_cache_buffer);
- Line 2218:
mXqaDispatcher->run(xqaParams, kv_cache_buffer, kv_scale_cache_buffer);
No two-argument invocations remain.
cpp/tensorrt_llm/kernels/xqaDispatcher.cpp (7)
52-54
: LGTM! Correct extension for FP4 KV cache support.The modification properly extends the data type check to include
DATA_TYPE_E2M1
(FP4) alongside the existing FP8 support, maintaining the same conversion logic.
172-173
: Good generalization of the data type check.The change from hardcoded
DATA_TYPE_E4M3
tomFixedParams.mathDataType
makes the check more flexible and maintainable for supporting different quantization modes.
245-257
: LGTM! Proper cache type determination for NVFP4.The logic correctly extends the cache type detection to include FP4 KV cache quantization mode.
313-317
: Consistent scale pointer handling for FP4/FP8.The code correctly sets up scale pointers when either FP8 or FP4 KV cache is enabled. Note the member name change from
kvScaleOrigQuant
tokv_scale_orig_quant
for consistency.
330-332
: Clear documentation for NVFP4-specific behavior.Good addition of explanatory comments about the second-level scale factors for NVFP4 KV cache.
449-461
: Correct parameter propagation in KVLinearBuffer overload.The new
kv_cache_block_scales_buffer
parameter is properly added and forwarded torunImpl
.
464-476
: Correct parameter propagation in KVBlockArray overload.The new
kv_cache_block_scales_buffer
parameter is consistently added and forwarded torunImpl
.cpp/tensorrt_llm/thop/attentionOp.cpp (2)
256-279
: Well-implemented FP4 block scale pool handling.The code correctly:
- Checks for FP4 KV cache mode
- Validates required pointers with TLLM_CHECK
- Calculates proper offsets using vector_size=16 and 1 byte per E4M3 scale factor
- Assigns pointers with correct arithmetic
783-783
: Correct Torch binding update for FP4 support.The new
host_kv_cache_block_scale_pool_pointers
parameter is properly added to the Torch library fragment definition.cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/kernelParams.h (5)
395-432
: Well-designed TMEM storage mode for FP4 K/V data.The implementation correctly handles two storage modes:
- Standard mode: FP4 elements packed 2 per byte (divide by colIdxDivisor)
- TMEM mode: dimensions reflect individual FP4 elements (no division)
The conditional logic properly adjusts TMA descriptors based on the storage mode.
475-505
: Correct bit-level pointer arithmetic for FP4 support.The conversion from byte-based to bit-based calculations properly handles variable bit widths, with correct division by 8 to convert to bytes for pointer arithmetic.
521-552
: Excellent support for packed and unpacked FP4 formats.The implementation correctly handles:
- Packed FP4 format using
CU_TENSOR_MAP_DATA_TYPE_UINT8
- Unpacked FP4 format using
CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B
for TMEM- Appropriate swizzle type selection for each format
677-680
: Hardware-specific TMEM optimization for FP4.The detection logic for
storeTransformedKvInTmem
requires very specific conditions:
- FP4 KV data type (E2M1)
- FP8 Q data type (E4M3)
- Head dimension of 128
- Kernel uses swapped MMA operations
Please verify these specific hardware requirements are documented and that the optimization is beneficial for the targeted hardware configurations.
720-726
: Good unification of KV scaling factor pointers.The change from separate K/V scale pointers to a unified
kvSfPtr
simplifies the interface while maintaining functionality.cpp/tensorrt_llm/common/attentionOp.cpp (15)
214-217
: LGTM!The FP4 KV cache data type handling is correctly implemented following the established pattern for other quantization modes.
931-932
: Placeholder for future MLA support is appropriate.The comment clearly indicates that NVFP4 KV cache support for MLA is not yet implemented, and the empty placeholder maintains API consistency.
1285-1299
: Correct element size calculation for different quantization modes.The refactored code properly handles bit widths for FP4 (4 bits), Int8/FP8 (8 bits), and full precision, with accurate conversion to bytes in the final calculation.
1308-1316
: Scale buffer initialization for FP4 KV cache is properly implemented.The code correctly initializes the scale buffer only when FP4 KV cache is enabled, using appropriate pool pointers and size calculation (1/8th of main buffer size for scale factors).
1324-1324
: Important validation for FP4 KV cache constraints.The check correctly enforces that FP4 KV cache only supports paged KV mode, which is likely required for efficient scale factor management at the block level.
1522-1535
: Cache type determination correctly includes NVFP4.The logic properly sets the cache type based on the quantization mode, with NVFP4 correctly added for FP4 KV cache.
1583-1599
: Preprocessing parameters properly extended for FP4 scale factors.The code correctly adds the scale buffer and documents the two-level scaling approach, with kv_scale_orig_quant as fallback when separate K/V scales are not provided.
1749-1749
: Scale cache correctly passed to FMHA runner.The pagedKvSfCache parameter is properly set to enable FMHA kernels to access FP4 scale factors.
2127-2134
: Scale buffer initialization in generation phase is consistent.The code correctly initializes the FP4 scale buffer following the same pattern as the context phase.
2142-2142
: Consistent FP4 validation in generation phase.The validation correctly enforces the paged KV requirement for FP4 cache in the generation phase.
2209-2209
: Scale cache properly passed to XQA dispatcher.The XQA dispatcher correctly receives the scale cache buffer for FP4 quantized attention computation.
2235-2238
: Clear error messaging for unsupported FP4 KV cache.The error check provides helpful feedback when FP4 KV cache is enabled but no supporting kernels are available.
2504-2507
: Appropriate validation of FP4 KV cache requirements.The check correctly enforces that FP4 KV cache requires FP8 context FMHA, ensuring the necessary infrastructure is available.
2584-2587
: FMHA data type correctly configured for FP4 KV cache.The code properly sets the KV data type to E2M1 when FP4 KV cache is enabled, ensuring FMHA kernels use the correct format.
2789-2793
: XQA parameters correctly configured for FP4 KV cache.The code appropriately sets KV data type to E2M1 (FP4) for storage and math data type to E4M3 (FP8) for computation, balancing storage efficiency with computational accuracy.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1)
2-3
: LGTM: Kernel binary updated for FP4 KV cache support.The updated Git LFS pointer indicates successful recompilation of the FMHA kernel binary with FP4 support. The significant size reduction (from 1,635,271 to 984,837 bytes) suggests optimizations were applied during recompilation.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1)
2-3
: LGTM: Kernel binary optimized for FP4 KV cache.The updated LFS pointer shows successful kernel recompilation with size optimization. This aligns with the broader FP4 KV cache implementation across the kernel stack.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1)
2-3
: LGTM: Consistent kernel recompilation for FP4 support.The binary update maintains consistency with other kernel recompilations in this PR, supporting the new FP4 KV cache quantization features.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)
2-3
: LGTM: Final kernel binary completed for FP4 implementation.This completes the set of kernel binary updates with another significant size reduction, indicating successful optimization during recompilation for FP4 KV cache support.
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)
499-499
: LGTM: Correct blockSize calculation for FP4 support.Removing the division by
quantSize
is correct for FP4 data types where element count per container is managed explicitly rather than adjusting block size through division. This change aligns with the updated FP4 handling in the corresponding implementation file.
1254-1254
: LGTM: Essential interface addition for FP4 block scale management.The new
getBlockScalePoolPointers()
pure virtual method is necessary for FP4 KV cache quantization support, providing access to block scale pool pointers required for dequantization. This interface addition is properly designed and integrates well with the existing architecture.
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h
Show resolved
Hide resolved
cpp/tensorrt_llm/kernels/unfusedAttentionKernels/unfusedAttentionKernels_2_template.h
Outdated
Show resolved
Hide resolved
/bot run |
PR_Github #12513 [ run ] triggered by Bot |
@kaiyux Hi, could you pls add appropriate reviewers, as this PR involves changes to KV cache manager. |
PR_Github #12513 [ run ] completed with state |
aacf6c6
to
a477bff
Compare
/bot run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
cpp/tensorrt_llm/common/attentionOp.cpp (3)
931-932
: Placeholder for future MLA support.The comment clearly indicates that NVFP4 KV cache support for MLA is not yet implemented. Consider tracking this as a TODO item for future implementation.
Would you like me to create an issue to track the implementation of NVFP4 KV cache support for MLA?
1283-1299
: Logic is correct, minor comment clarification needed.The element size calculation correctly handles different quantization modes. Consider clarifying the comment on line 1299:
- auto sizePerToken = mNumKVHeads * headSize * elemBits / 8 /*bits*/; + auto sizePerToken = mNumKVHeads * headSize * elemBits / 8 /*bits to bytes*/;
2093-2110
: Consistent element size calculation for generation phase.The logic matches the context phase implementation. Consider the same comment clarification:
- auto const sizePerToken = mNumKVHeads * headSize * elemBits / 8 /*bits*/; + auto const sizePerToken = mNumKVHeads * headSize * elemBits / 8 /*bits to bytes*/;
📜 Review details
Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
(3 hunks)cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
(3 hunks)cpp/tensorrt_llm/common/attentionOp.cpp
(18 hunks)cpp/tensorrt_llm/common/attentionOp.h
(1 hunks)cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h
(1 hunks)
✅ Files skipped from review due to trivial changes (2)
- cpp/tensorrt_llm/common/attentionOp.h
- cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h
🚧 Files skipped from review as they are similar to previous changes (2)
- cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
- cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp,cc,cxx}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
**/*.{cpp,h,hpp,cc,cxx}
: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...
Files:
cpp/tensorrt_llm/common/attentionOp.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}
📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.
Files:
cpp/tensorrt_llm/common/attentionOp.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (12)
cpp/tensorrt_llm/common/attentionOp.cpp (12)
215-218
: LGTM!The FP4 KV cache support is correctly added with the appropriate data type
DATA_TYPE_E2M1
.
1309-1316
: Correct implementation of FP4 scale buffer allocation.The scale buffer is appropriately sized at 1/8th of the main KV cache buffer and uses dedicated pool pointers for memory management.
1324-1325
: Good validation for FP4 KV cache constraints.The runtime check correctly enforces that FP4 KV cache only supports paged KV cache mode.
1464-1465
: Proper configuration for FP4 scale separation.The decoder parameters correctly configure separate QKV scales when FP4 KV cache is enabled.
1523-1535
: Correct cache type enumeration for FP4.The cache type is properly set to
KvCacheDataType::NVFP4
when FP4 KV cache is enabled.
1574-1576
: Preprocessing parameters correctly extended for FP4.The QKV preprocessing parameters properly include the scale buffer for FP4 KV cache support.
Also applies to: 1584-1584, 1597-1597
1747-1747
: FMHA parameters properly configured for FP4 scale cache.The paged KV scale cache buffer is correctly passed to the FMHA runner.
2121-2128
: Consistent scale buffer allocation in generation phase.The FP4 scale buffer allocation matches the context phase implementation appropriately.
2136-2136
: Proper runtime validation and XQA integration for FP4.The code correctly:
- Enforces paged KV requirement for FP4
- Passes scale buffer to XQA dispatcher
- Provides clear error when FP4 kernels are unavailable
Also applies to: 2208-2208, 2225-2228
2495-2496
: Important dependency check for FP4 KV cache.The requirement that FP8 context FMHA must be enabled for FP4 KV cache is properly enforced.
2574-2577
: FMHA correctly configured for FP4 KV data type.The KV data type is properly set to
DATA_TYPE_E2M1
for FP4 KV cache in the FMHA configuration.
2779-2783
: XQA dispatcher properly configured for FP4.The configuration correctly uses
DATA_TYPE_E2M1
for storage andDATA_TYPE_E4M3
for math operations, balancing storage efficiency with computational precision.
PR_Github #14368 [ run ] triggered by Bot |
/bot run |
PR_Github #14370 [ run ] triggered by Bot |
PR_Github #14368 [ run ] completed with state |
PR_Github #14370 [ run ] completed with state |
/bot run |
PR_Github #14374 [ run ] triggered by Bot |
PR_Github #14374 [ run ] completed with state |
@Tom-Zheng can you describe what is the expected behavior of nvfp4 kv cache in the context phase (when using chunked prefill/kv cache reuse) ? are they compatible ? thanks. |
The behavior regarding chunked prefill/kv cache reuse is the same as FP8 KV, both kernel side and KV cache manager side. For KV cache manager, there is another pool (block scale pool) to store the scaling factors. It has the same structure with the data pool. It allocates/copies together with the data pool. |
okay, thanks for confirming, so the nvfp4 is fully supported in both context and generation phase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM on the llmapi changes
Signed-off-by: Tian Zheng <[email protected]>
Hi @Tom-Zheng, for NVFP4 KV, does it mean that the attention (Q@K, A@V) is calculated with FP4 TMA? |
Math is in FP8 for now. |
Hi @Tom-Zheng , does trt-llm support FP4 Gemm in fmha? |
Signed-off-by: Tian Zheng <[email protected]>
Not yet. |
/bot run |
PR_Github #16936 [ run ] triggered by Bot |
@Tom-Zheng Hi Tian, is there any plan to prepare a doc for this NVFP4 KV Cache feature? Also in the doc I would expect that there can be clear introduction about the interactions of this feature with other features, such as:
This requirement can be more than before since we are trying to make sure new features contributed into TensorRT-LLM can have a coherent interactions with the existing features to reduce the future tech debts. You can also reach out to @QiJune and @nv-guomingz so they can provide the needed support. Another question: is there any expected perf speed up with this PR for certain scenario? |
Sure, I'll work with team to add the documentation. The plan is to get this PR merged, then create another PR for documentation. Does that work?
Long sequence (>4K) decode phase should see some speedup. |
Signed-off-by: Tian Zheng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
766-773
: Same ‘self’ undefined bug in remaining-layers path.Mirror the fix here to resolve Ruff F821.
Apply:
- if dtype == DataType.NVFP4: - cache_size_bytes_per_token += self.calculate_scaling_factor_size_bytes( + if dtype == DataType.NVFP4: + cache_size_bytes_per_token += KVCacheManager.calculate_scaling_factor_size_bytes( cache_size_per_token, quant_vector_size=16, scaling_factor_dtype=DataType.FP8)
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
1-1
: Add NVIDIA copyright header.Per coding guidelines, prepend the current-year NVIDIA copyright header to this Python file.
Apply:
+# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py
(6 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,cc,cxx,cu,py,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use spaces only; indent 4 spaces
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports; prefer from package.subpackage import module then module.Symbol
Python file names use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variable names use snake_case; if starting with a number, prefix k (e.g., k_99th_percentile)
Global variables use G_ prefix and UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE
Avoid shadowing variables from an outer scope
Initialize all externally visible members of a class in init
For interfaces used outside a file, prefer docstrings; reserve comments for internal code or local interfaces
Use Google-style docstrings for classes and functions; document attributes/variables inline as shown
Avoid reflection when simple, explicit code suffices (e.g., prefer def make_complex(x,y) over locals()/dict tricks)
Catch the narrowest exceptions possible in try/except
For duck-typing try/except, keep try body minimal and use else for main logic
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
🧠 Learnings (2)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
PR: NVIDIA/TensorRT-LLM#7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
tensorrt_llm/_utils.py (4)
get_size_in_bytes
(210-213)nvtx_range
(850-869)dtype
(942-943)dtype
(950-960)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py
736-736: Undefined name self
(F821)
769-769: Undefined name self
(F821)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)
17-17
: Import updates look good.Switch to get_size_in_bytes/nvtx_range is appropriate.
536-543
: Confirm scale dtype choice for NVFP4.You assume FP8 scales. If kernel/manager uses a different scale dtype (e.g., uint8/log or per-block metadata), size math must match it.
Would you confirm the scale dtype via the C++ API (get_block_scale_pool_pointers element type) and update scaling_factor_dtype if needed?
532-543
: calculate_scaling_factor_size_bytes isn’t static; adjust or keep instance call
NVFP4→4 mapping already present in _binding_dtype_bits. Either remove the suggestion to call via KVCacheManager.calculate_scaling_factor_size_bytes (keepself.calculate_scaling_factor_size_bytes
), or refactor the helper to a true@staticmethod
(drop theself
parameter) before invoking it on the class.Likely an incorrect or invalid review comment.
350-357
: Ignore refactor: stacking block‐scale pointers into kv_cache_pool_pointers is intentional and downstream kernels handle both shapes (see attentionOp.cpp lines 271 and 292).Likely an incorrect or invalid review comment.
Signed-off-by: Tian Zheng <[email protected]>
Signed-off-by: Tian Zheng <[email protected]>
/bot run |
PR_Github #16954 [ run ] triggered by Bot |
PR_Github #16936 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
511-518
: Fix NVFP4 scale-bytes math: use ceil grouping, not divisibility + floor.cache_size may not be a multiple of quant_vector_size. The current assert is too strict and undercounts when not divisible.
Apply:
- @staticmethod - def calculate_scaling_factor_size_bytes( - cache_size: int, quant_vector_size: int, - scaling_factor_dtype: DataType) -> int: - assert cache_size % quant_vector_size == 0, "NVFP4 cache size must be divisible by quant vector size" - return get_size_in_bytes(cache_size // quant_vector_size, - scaling_factor_dtype) + @staticmethod + def calculate_scaling_factor_size_bytes( + cache_size: int, quant_vector_size: int, + scaling_factor_dtype: DataType) -> int: + """Return extra bytes for per-block scales: ceil(cache_size / quant_vector_size) scales.""" + assert cache_size > 0 and quant_vector_size > 0, "cache_size and quant_vector_size must be positive" + groups = (cache_size + quant_vector_size - 1) // quant_vector_size + return get_size_in_bytes(groups, scaling_factor_dtype)
🧹 Nitpick comments (3)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
1-1
: Add NVIDIA copyright header.Per repo guidelines, prepend the 2025 NVIDIA copyright header to all source files.
350-357
: Stacking pool pointers changes tensor rank; verify downstream consumers.self.kv_cache_pool_pointers can be 1-D or 2-D depending on NVFP4 scale availability. Confirm all call sites handle both shapes, or keep a stable shape and expose scales separately.
Option A (stable shape): always stack and pad with zeros when scales absent.
Option B (BC-safe): keep existing self.kv_cache_pool_pointers and store scales in a new attribute self.kv_cache_block_scale_pool_pointers.I can generate a patch once you confirm preferred contract.
537-544
: Deduplicate NVFP4 constants.Replace magic numbers with module-level constants to keep FP4 packing coherent across sites.
+NVFP4_QUANT_VECTOR_SIZE = 16 +NVFP4_SCALE_DTYPE = DataType.FP8 ... - cache_size_bytes_per_token += KVCacheManager.calculate_scaling_factor_size_bytes( - cache_size_per_token, quant_vector_size=16, scaling_factor_dtype=DataType.FP8) + cache_size_bytes_per_token += KVCacheManager.calculate_scaling_factor_size_bytes( + cache_size_per_token, quant_vector_size=NVFP4_QUANT_VECTOR_SIZE, scaling_factor_dtype=NVFP4_SCALE_DTYPE)Also applies to: 734-741, 767-774
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py
(6 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,cc,cxx,cu,py,h,hpp,hh,hxx,cuh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use spaces only; indent 4 spaces
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports; prefer from package.subpackage import module then module.Symbol
Python file names use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variable names use snake_case; if starting with a number, prefix k (e.g., k_99th_percentile)
Global variables use G_ prefix and UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE
Avoid shadowing variables from an outer scope
Initialize all externally visible members of a class in init
For interfaces used outside a file, prefer docstrings; reserve comments for internal code or local interfaces
Use Google-style docstrings for classes and functions; document attributes/variables inline as shown
Avoid reflection when simple, explicit code suffices (e.g., prefer def make_complex(x,y) over locals()/dict tricks)
Catch the narrowest exceptions possible in try/except
For duck-typing try/except, keep try body minimal and use else for main logic
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
PR: NVIDIA/TensorRT-LLM#7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
tensorrt_llm/_utils.py (5)
binding_to_str_dtype
(200-203)get_size_in_bytes
(210-213)nvtx_range
(850-869)dtype
(942-943)dtype
(950-960)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)
17-17
: Import change LGTM.Switch to get_size_in_bytes centralizes dtype math and removes ad-hoc byte calculations.
533-535
: Enable NVFP4 in supported dtypes — LGTM.
545-574
: Float math for max_tokens is fine; ensure MPI reduction stays numeric.max_tokens is float; MPI.MIN over float is intended. Downstream ceil() is applied before block counts.
Do we need to clamp max_tokens to >= 0 after allreducing in heterogeneous nodes? If yes, add max(0, ...).
@juney-nvidia I had a sync with @nv-guomingz. We decided to add the document after #6696 is merged. |
Summary by CodeRabbit
New Features
Improvements
Tests
Description
Test Coverage
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.