Skip to content

Conversation

Tom-Zheng
Copy link
Collaborator

@Tom-Zheng Tom-Zheng commented Jul 22, 2025

Summary by CodeRabbit

  • New Features

    • NVFP4 (FP4) KV-cache support with dedicated block-scale buffers and runtime KV-scale overrides exposed to Python.
  • Improvements

    • Unified bit-based sizing and propagation of FP4 block-scale buffers through attention/FMHA/XQA pipelines.
    • Optional per-Q/K/V dequant scales and plan/run kwargs to override KV scaling; NVFP4 added to data-type/quantization options.
  • Tests

    • New unit and integration tests for NVFP4 KV-cache and guided decoding.

Description

Test Coverage

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@Tom-Zheng Tom-Zheng requested review from a team as code owners July 22, 2025 03:06
@Tom-Zheng Tom-Zheng requested review from juney-nvidia and brb-nv July 22, 2025 03:06
Copy link
Contributor

coderabbitai bot commented Jul 22, 2025

📝 Walkthrough

Walkthrough

Adds end-to-end NVFP4 (FP4) KV-cache support: removes quantSize, introduces block-scale pool pointers and per-container element accounting, propagates FP4 block-scale buffers through attention/XQA/FMHA/kernels/bindings/torch layers, consolidates Q/K/V scale fields, and updates tests and resource sizing.

Changes

Cohort / File(s) Summary
KV-cache manager headers
cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
Remove quantSize; compute blockSize = numKvHeads * sizePerHead * tokensPerBlock; add BaseKVCacheManager::getBlockScalePoolPointers() and KVCacheManager override; add WindowBlockManager::getNumEltsPerContainer().
KV-cache manager impl & pybind
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp
Use per-container element count for FP4 sizing, adjust pool construction/allocation and block-scale pool creation, remove FP4 blockSize halving, add pybind get_block_scale_pool_pointers() binding.
Attention core & host wiring
cpp/tensorrt_llm/common/attentionOp.h, cpp/tensorrt_llm/common/attentionOp.cpp, cpp/tensorrt_llm/thop/attentionOp.cpp, tensorrt_llm/_torch/modules/attention.py
Add host primary/secondary pool pointers and block-scale pool pointers to EnqueueParams; add getKvCacheElemSizeInBits(); switch per-token sizing to bit-based; validate and forward NVFP4 block-scale pointers and scales.
XQA dispatcher API & impl
cpp/tensorrt_llm/kernels/xqaDispatcher.h, cpp/tensorrt_llm/kernels/xqaDispatcher.cpp
Extend run/runImpl overloads to accept kv_cache_block_scales_buffer; detect NVFP4 cache type; populate preprocessing params with block-scale buffer and forward it through run.
FMHA runner / kernel params / dispatch
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h, cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaRunnerParams.h, cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaRunner.cpp, cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/kernelParams.h, cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp
Add pagedKvSfCache; unify K/V SF pointers into kvSfPtr; accept DATA_TYPE_E2M1 for dtypeKv; change TMA/KV shape & stride APIs for transformed FP4 in TMEM; switch pointer arithmetic to bits; rename mStartTokenIdxSfOmStartTokenIdx; thread kvSfPtr through dispatcher.
Decoder / preprocessing & kernels
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/..., cpp/tensorrt_llm/kernels/unfusedAttentionKernels.h, cpp/tensorrt_llm/kernels/unfusedAttentionKernels/unfusedAttentionKernels_2_template.h
Rename/consolidate preprocessing scale fields (kvScaleOrigQuantqkv_scale_orig_quant), remove kv_cache_scale_factors, update kernel code to source separate Q/K/V scales and to use new fields.
GPT kernels & FMHA prep
cpp/tensorrt_llm/kernels/gptKernels.h, cpp/tensorrt_llm/kernels/gptKernels.cu
Replace dequantScaleQ/dequantScaleKv with dequantScaleQkv + separateQkvScales; compute/use separate Q/K/V dequant scales in FMHA prep and scaling.
Host kernel params / TMA descriptor changes
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/kernelParams.h
Change getDevicePtrs to accept bitsPerElt; add storeTransformedKvInTmem/unpack4b options; adjust K/V shapes/strides, TMA data formats, swizzle rules, and SF usage to use unified kvSfPtr.
Host libs, resource manager & utils
tensorrt_llm/_torch/modules/linear.py, tensorrt_llm/_torch/pyexecutor/_util.py, tensorrt_llm/_torch/pyexecutor/resource_manager.py, tensorrt_llm/_utils.py
Add KV-scale attributes and loader (kv_scales / inv_kv_scales) and plan/forward kv_scales_sf inputs; add NVFP4 dtype handling; introduce get_size_in_bytes (bits→bytes) and account NVFP4 block-scale memory overhead in resource manager.
Bindings (pybind / nanobind)
cpp/tensorrt_llm/pybind/bindings.cpp, cpp/tensorrt_llm/nanobind/bindings.cpp
Expose DataType::NVFP4 and QuantMode.has_fp4_kv_cache to Python/nanobind; add pybind accessor for block-scale pool pointers.
Tests & integration
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp, cpp/tests/unit_tests/kernels/ropeTest.cu, tests/integration/defs/accuracy/*, tests/integration/test_lists/*, tests/integration/defs/accuracy/test_llm_api_pytorch.py
Update unit tests for new KVCacheBlockPool ctor and FP4 element semantics; replace test scale fields with qkv_scale_orig_quant; add NVFP4 accuracy references; add integration test test_nvfp4_kv and guided-decoding test; update test lists.

Sequence Diagram(s)

sequenceDiagram
  participant Host
  participant AttentionOp
  participant XqaDispatcher
  participant FMHA_Dispatcher
  participant GPU_Kernels
  rect rgb(240,248,255)
  Host->>AttentionOp: enqueue(params, kv_cache_buffer, kv_cache_block_scales_buffer)
  end
  AttentionOp->>XqaDispatcher: run(xqaParams, kv_cache_buffer, kv_cache_block_scales_buffer)
  XqaDispatcher->>FMHA_Dispatcher: run(..., pagedKvCache, pagedKvSfCache)
  FMHA_Dispatcher->>GPU_Kernels: launch(kernelParams with kvPtr, kvSfPtr)
  GPU_Kernels-->>Host: results
Loading
sequenceDiagram
  participant Host
  participant KVCacheManager
  participant Pools
  rect rgb(248,248,255)
  Host->>KVCacheManager: construct WindowBlockManager(...)
  KVCacheManager->>Pools: create data pools using sizePerHead / numEltsPerContainer
  KVCacheManager->>Pools: create block-scale pools sized per quantBlock using blockScaleSizePerHead
  Pools-->>Host: return data + block-scale pool pointers
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

KV-Cache Management

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbit in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbit in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbit gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbit read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbit help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbit ignore or @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbit summary or @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbit or @coderabbitai title anywhere in the PR title to generate the title automatically.

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplPrecompiled.cpp (1)

215-227: kv_cache_block_scales_buffer is left uninitialised – likely a functional bug for FP4 path

With NVFP4 support, the preprocessing kernel expects a valid block-scale buffer.
Here the field is set to an empty brace-initialiser ({}), while the immediately-adjacent scale pointer kv_scale_orig_quant is taken from xqaParams.
Unless the preprocessing kernel is explicitly guarded to ignore the field for every non-FP4 run, this will dereference a null pointer once the FP4 KV-cache code-path is exercised.

-        preprocessingParams.kv_cache_block_scales_buffer = {};
+        // Pass through the block-scale pool allocated in the caller
+        preprocessingParams.kv_cache_block_scales_buffer = xqaParams.kv_cache_block_scales_buffer;

Please ensure:

  1. xqaParams exposes kv_cache_block_scales_buffer for the generation phase, and
  2. All kernel-launch helpers (invokeQKVPreprocessing, downstream FMHA) have the same expectation.

Failing to wire this pointer will produce silent wrong results or a hard fault when FP4 is enabled.

♻️ Duplicate comments (21)
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1)

2-3: Same provenance concern as above

The cubin hash & size changed but no build record is provided.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Same provenance concern as above

The cubin hash & size changed but no build record is provided.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Same provenance concern as above

The cubin hash & size changed but no build record is provided.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128StaticContext_cubin.cpp (1)

2-3: Same provenance concern as above

The cubin hash & size changed but no build record is provided.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Same provenance concern as first cubin

See comment on the first cubin file; the same reproducibility note applies here.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Same provenance concern as first cubin

See comment on the first cubin file; the same reproducibility note applies here.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Same provenance concern as first cubin

See comment on the first cubin file; the same reproducibility note applies here.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1)

2-3: Same provenance concern as first cubin

See comment on the first cubin file; the same reproducibility note applies here.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1)

1-4: Same verification needed as for the first cubin update.

See previous comment – identical reproducibility and registration checks apply here.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

1-4: Same verification needed as for the first cubin update.

See previous comment – identical reproducibility and registration checks apply here.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvCausalP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1)

1-4: Same verification needed as for the first cubin update.

See previous comment – identical reproducibility and registration checks apply here.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvCausalP32VarSeqQ128Kv128StaticContext_cubin.cpp (1)

1-4: Same verification needed as for the first cubin update.

See previous comment – identical reproducibility and registration checks apply here.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Same concern as in the previous cubin file: please document build provenance and guarantee test coverage for NVFP4 paths.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128StaticContext_cubin.cpp (1)

2-3: Same concern as in the previous cubin file: please document build provenance and guarantee test coverage for NVFP4 paths.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvCausalP32VarSeqQ128Kv128StaticContext_cubin.cpp (1)

2-3: Same concern as in the previous cubin file: please document build provenance and guarantee test coverage for NVFP4 paths.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1)

2-3: Same concern as in the previous cubin file: please document build provenance and guarantee test coverage for NVFP4 paths.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Same verification needed for updated cubin

New hash 1428fe48…, size 596 486. Re-run the HW/QA checks described in the previous comment.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1)

2-3: Same verification needed for updated cubin

Hash 42163217…, size 647 265.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Same verification needed for updated cubin

Hash 5a3141a6…, size 493 038.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ128Kv128StaticContext_cubin.cpp (1)

2-3: Same verification needed for updated cubin

Hash d91996ed…, size 878 391.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1)

2-3: Same concern as above – confirm compatibility after sha change

Ensure the new 5f3641e5… cubin is detectable and selected by the runtime for H64 paged-KV variant; otherwise the kernel picker may silently fall back to a slower path.

🧹 Nitpick comments (12)
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Add provenance / build-time metadata for regenerated cubin

Only the LFS pointer hash & size changed. Without any accompanying information (PTX version, compiler flags, SM target, git SHA of the generator, etc.) it’s impossible for downstream maintainers to reproduce or audit the new binary, and CI artifact drift can creep in silently.

Please attach (e.g., in docs/kernel_manifests/…json) a small manifest that records:
• kernel name
• commit that produced it
• exact nvcc/ptxas flags & CUDA version
• git SHA of the source .cu template
• date / author

This keeps the binaries traceable and eases security review.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Document build provenance for the new CUBIN to keep the binary audit-ready

Only the LFS pointer (SHA-256 + size) changed, so reviewers cannot reason about functional differences.
Please record in the PR (or a BUILD.md) the exact commit / branch of the kernel generator, compiler versions, and flags used to produce this cubin, or add a CI rule that rebuilds and diff-checks it. That keeps the artifact reproducible and prevents silent drift.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Confirm provenance & test coverage for the new cubin

Only the LFS pointer (SHA-256 + size) changed, so the actual binary was swapped in.
Because sources & build scripts for these kernels are not part of the repo, reviewers cannot independently verify that the new cubin really corresponds to the advertised NVFP4 support, compiler flags, etc.

Action items

  1. Add a short note to the PR description (or a BUILD.md snippet) with:
    • generator commit hash / tag
    • compiler version & flags
    • explicit opt-ins (e.g. -DNVFP4_KV_CACHE=ON)
    • SM target list
  2. Ensure CI executes FMHA unit / integration tests with DataType::kNvFp4 so the new blob is actually exercised.

This makes future rebuilds reproducible and prevents silent regressions.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Minor: keep commit message listing all new hashes

Helps future bisects if each updated cubin hash is recorded in the commit message or a CHANGELOG entry.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Cubin replaced – confirm deterministic rebuild.

Because the binary shrank ~35 %, verify that:

  • The compile flags (--use_fast_math, PTX version, etc.) match the rest of the FP4 cubins.
  • The new build is reproducible; otherwise, future patch releases may diverge.

If you have a reproducibility script, please drop a reference in docs/release_build.md.

cpp/tensorrt_llm/common/attentionOp.h (1)

132-183: Consider adding new pointers to debug output for consistency.

The enqueueContextParamsToString() function includes the existing host_primary_pool_pointer and host_secondary_pool_pointer (lines 168-169) but doesn't include the newly added scale pool pointers. For debugging consistency, consider adding them:

            ss << "host_primary_pool_pointer: " << this->host_primary_pool_pointer << std::endl;
            ss << "host_secondary_pool_pointer: " << this->host_secondary_pool_pointer << std::endl;
+           ss << "host_primary_block_scale_pool_pointer: " << this->host_primary_block_scale_pool_pointer << std::endl;
+           ss << "host_secondary_block_scale_pool_pointer: " << this->host_secondary_block_scale_pool_pointer << std::endl;
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Dense-P64 cubin swapped – ensure LFS pointer committed

Just a reminder: confirm .gitattributes still tracks *.cubin.cpp with filter=lfs so the real binary is pulled in downstream clones; otherwise buildbots without LFS will fail at link time.

cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2)

537-538: Avoid hard-coding magic constants; reuse implementation-side definitions

numFp4EltsPerContainer and vectorSize duplicate values that already exist in production code (kNumFp4EltsPerContainer, kVectorSize in kvCacheUtils.h). Importing those symbols (or exposing them via a dedicated test header) prevents drift if the implementation changes.

-    auto constexpr numFp4EltsPerContainer = 2;
-    auto constexpr vectorSize = 16;
+#include "tensorrt_llm/kernels/kvCacheUtils.h"
+    auto constexpr numFp4EltsPerContainer = tk::kNumFp4EltsPerContainer;
+    auto constexpr vectorSize = tk::kVectorSize;

560-563: Assertion looks correct but can overflow with larger sizes

The expression blockManager.getBlockSize(0) * numFp4EltsPerContainer is 32-bit-multiplication (returns SizeType32). With larger heads / tokens an overflow is possible before the subsequent / vectorSize.
Cast to 64-bit to make the test future-proof:

-    EXPECT_EQ(blockManager.getBlockSize(0) * numFp4EltsPerContainer / vectorSize,
-              blockManager.getBlockSize(1));
+    EXPECT_EQ(static_cast<std::uint64_t>(blockManager.getBlockSize(0)) * numFp4EltsPerContainer / vectorSize,
+              static_cast<std::uint64_t>(blockManager.getBlockSize(1)));
cpp/tensorrt_llm/thop/attentionOp.cpp (2)

226-241: Correct bit-level calculation for FP4 support.

The change from byte-based to bit-based calculation properly handles FP4's 4-bit elements. Consider renaming the variable to cache_elem_size_bits for clarity.

-        // The cache element size in bits.
-        int cache_elem_bits;
+        // The cache element size in bits.
+        int cache_elem_size_bits;

738-738: Correct extension of FP8ContextFMHA to support FP4.

The flag now properly enables the quantized context FMHA path for both FP8 and FP4 KV cache modes. Consider adding a comment to clarify that this flag supports both quantization modes despite the "FP8" in its name.

cpp/tensorrt_llm/common/attentionOp.cpp (1)

2101-2117: Consider refactoring duplicated element size calculation.

This element size calculation logic is duplicated from the context phase (lines 1285-1299). Consider extracting it into a helper function to improve maintainability.

// Helper function to calculate element size based on quantization mode
static int getElementBits(tc::QuantMode const& kvCacheQuantMode) {
    if (kvCacheQuantMode.hasInt8KvCache() || kvCacheQuantMode.hasFp8KvCache()) {
        return 8;
    } else if (kvCacheQuantMode.hasFp4KvCache()) {
        return 4;
    } else {
        return sizeof(T) * 8;
    }
}
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fddb7f1 and 5099b64.

📒 Files selected for processing (107)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3 hunks)
  • cpp/tensorrt_llm/common/attentionOp.cpp (17 hunks)
  • cpp/tensorrt_llm/common/attentionOp.h (1 hunks)
  • cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h (1 hunks)
  • cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/decoderXQAImplJIT.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplPrecompiled.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp (3 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvCausalP32VarSeqQ128Kv128StaticContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvCausalP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvCausalP64VarSeqQ128Kv128StaticContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ128Kv128StaticContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ128Kv128StaticContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128StaticContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ128Kv128StaticContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvCausalP32VarSeqQ128Kv128StaticContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvCausalP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvCausalP64VarSeqQ128Kv128StaticContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ128Kv128StaticContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ128Kv128StaticContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128StaticContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ128Kv128StaticContext_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaRunner.cpp (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaRunnerParams.h (1 hunks)
  • cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/kernelParams.h (12 hunks)
  • cpp/tensorrt_llm/kernels/unfusedAttentionKernels.h (2 hunks)
  • cpp/tensorrt_llm/kernels/unfusedAttentionKernels/unfusedAttentionKernels_2_template.h (6 hunks)
  • cpp/tensorrt_llm/kernels/xqaDispatcher.cpp (8 hunks)
  • cpp/tensorrt_llm/kernels/xqaDispatcher.h (2 hunks)
  • cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1 hunks)
  • cpp/tensorrt_llm/pybind/bindings.cpp (2 hunks)
  • cpp/tensorrt_llm/thop/attentionOp.cpp (10 hunks)
  • cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (2 hunks)
⛔ Files not processed due to max files limit (14)
  • cpp/tests/unit_tests/kernels/ropeTest.cu
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tensorrt_llm/_torch/custom_ops/torch_custom_ops.py
  • tensorrt_llm/_torch/pyexecutor/_util.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_utils.py
  • tensorrt_llm/llmapi/llm_utils.py
  • tests/integration/defs/.test_durations
  • tests/integration/defs/accuracy/references/gsm8k.yaml
  • tests/integration/defs/accuracy/references/mmlu.yaml
  • tests/integration/defs/accuracy/test_llm_api_pytorch.py
  • tests/integration/test_lists/test-db/l0_b200.yml
  • tests/integration/test_lists/test-db/l0_h100.yml
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (112)
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Binary cubin reference updated – ensure provenance

Only the Git-LFS pointer (OID & size) changed; there’s no reviewable source.
Please confirm the cubin was rebuilt from the matching source commit with reproducible flags and is covered by CI artifact signing or checksum validation.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ128Kv128StaticContext_cubin.cpp (1)

2-3: Binary cubin reference updated – ensure provenance

Same note as above: only metadata changed.
Verify that the new cubin is reproducibly built and signed; otherwise downstream users cannot audit the binary.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1)

2-3: Binary cubin reference updated – ensure provenance

No source diff. Please keep build recipes or a hash manifest checked-in so the team can reproduce this cubin and verify it matches the committed SHA-256 hash.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1)

2-3: Binary cubin reference updated – ensure provenance

Again, only the LFS pointer changed. Please confirm the binary is built from tagged source; consider adding an automated check in CI that rebuilds and diff-checks cubins.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1)

2-3: Binary cubin reference updated – ensure provenance

Only metadata lines updated. Recommend documenting the compiler flags & commit SHA used to generate this cubin so future reviewers can reproduce or audit it.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1)

2-3: Binary kernel artifact refreshed – looks fine

Only the LFS pointer (SHA-256 + size) changed. Nothing else to review at source level.
Make sure the rebuilt cubin is reproducible with the current build scripts and that CI/functional tests covering FP4 KV-cache passes with this new binary.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Updated cubin acknowledged

Pointer update only. No actionable code issues. Please confirm this binary was compiled with the same tool-chain flags used for the rest of the FP4 roll-out.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1)

2-3: Cubin hash/size updated

No source changes present. Ensure the new kernel aligns with the unified KV scale-factor pointer introduced in the runner params.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Binary refresh noted

Nothing to review in text. Validate that performance baselines are unchanged/improved with this rebuilt cubin.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1)

2-3: Acknowledged new cubin pointer

No code concerns. Recommend running regression tests that exercise persistent-context path.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1)

1-4: Confirm reproducible build & runtime registration for updated cubin.

Only the Git-LFS pointer (SHA / size) changed, so we cannot review the binary logic itself.
Please double-check that:

  1. The cubin was rebuilt with the same compiler flags (SM 100, PTX version, -G off, determinism enabled) as the rest of the FMHA set – mismatched flags silently break mixed-build deployments.
  2. Every dispatcher table / manifest that lists built-in kernels is updated to reference this exact cubin name; otherwise the loader will fall back to an older artifact at runtime.

If you have a deterministic-build manifest, consider committing it next to the cubins to make future reviews easier.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Binary cubin updated – please confirm runtime compatibility

Only the Git LFS pointer (OID/size) changed, indicating a rebuilt kernel binary. No source code is available for static review.
Make sure:

  1. The new cubin was compiled with the same (or newer, compatible) CUDA & SM 100 tool-chain flags expected by the dispatcher.
  2. Any hard-coded SHA/size checks in loader code are updated.
  3. CI runs end-to-end attention tests exercising FP4 KV-cache paths on an SM100 device.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1)

2-3: Recompiled kernel artifact – ensure loader & params stayed in sync

The cubin’s hash/size changed; no textual diff is reviewable.
Please validate:

• New kernel still matches updated fmhaRunnerParams (packed KV scale ptr).
• Performance regressions are tracked (new cubin is ~35 KB larger).
• Unit tests covering H128 paged-KV persistent path pass.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvCausalP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1)

2-3: Large cubin replacement – double-check causal-context kernel launch

Given the FP4 KV-cache refactor, confirm that:

  1. Kernel launch code sets the packed-KV scale pointer for this causal variant.
  2. Memory footprint (TMEM descriptors) still fits occupancy targets on SM100.
  3. Regression tests with long-sequence causal generation succeed.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Size drop (~300 KB) – verify no feature loss

The new cubin is significantly smaller. Ensure build flags (e.g., -lineinfo, -res-usage) or disabled paths didn’t strip required functionality. Run sliding-window / chunked-causal benchmarks to catch latent issues.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Updated cubin – confirm CGA multi-CTAs path correctness

With FP4 KV-cache support, multi-CTA CGA kernels are especially sensitive to scale-pool pointer propagation. Please:

• Run distributed-KV stress test.
• Check for mismatched quant-scale leading to NaNs.
• Capture new PTX-generated register usage to watch for spill regressions.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1)

2-3: Confirm .cubin pointer is committed via Git-LFS

Only the LFS pointer (not the raw binary) should live in Git. Please double-check that
*.cubin.cpp is covered by the existing .gitattributes and that CI still fetches the blobs.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Same LFS-tracking sanity check

Ensure the new SHA/object is pushed to the LFS store; otherwise downstream git lfs pull will fail.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ128Kv128StaticContext_cubin.cpp (1)

2-3: Large-binary pointer update looks fine

Hash / size changed as expected after recompilation for FP4 support.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Reminder: keep binary artefacts out of review diffs

Nothing actionable, just noting the reduced cubin size (👍).

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1)

2-3: Ensure the updated cubin is registered in all build/package manifests

Binary-only change looks fine, but please double-check that:

  1. The new SHA (0bf00b1…) is pushed to the LFS store.
  2. Any CMake-style manifest or Python wheel packaging list has been updated so downstream consumers pull the refreshed cubin.

Missing one of these breaks runtime kernel loading on fresh clones/installs.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Verify kernel compatibility across driver/toolkit matrix

Size dropped ~35 %. Great, but smaller cubins occasionally miss older-SM fallback code-paths.
Please run the usual SM70/80 smoke tests before merge.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: LGTM – binary refresh only
No concerns; compilation artefact update is expected with FP4 support.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Confirm CI caches get invalidated

Because the file hash changed, cached Docker layers or pre-built wheels may still hold the old cubin.
Make sure the CI scripts force a clean rebuild (e.g., bump wheel version or clear $HOME/.cache/trtllm).

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Binary artifact updated – please run a smoke test on SM100 hardware.

The cubin hash and size changed, implying a full re-compile. Make sure:

  1. The CI matrix still flashes this cubin on an SM100 board.
  2. Runtime kernel selection logic (fmhaDispatcher) resolves to this exact cubin when dtype == NVFP4.

If either step is missing, the feature will silently fall back or seg-fault.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Large binary delta – double-check PTX/linker versions.

The size drop (~33 %) may come from switching CUDA versions or new linker options. Ensure the PTX major/minor baked into the cubin is still ≥ the minimum supported by the production driver fleet (documented in docs/compat_matrix.md).

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Binary drop-in looks good.

Nothing to flag here beyond the usual runtime-selection sanity checks already noted for the other cubins.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1)

2-3: Persistent-context cubin updated – verify memory footprint.

Persistent-context kernels keep on-device buffers alive between launches. The 40 % size reduction is great, but please profile that the saved registers/shared-mem did not regress occupancy vs. the previous build.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1)

2-3: Binary-pointer updated – looks correct

Only the Git-LFS oid and size fields changed to reference the newly-built cubin. Format and placement remain valid; nothing else to review.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ16Kv128PersistentSwapsAbForGen_cubin.cpp (1)

2-3: Binary-pointer updated – looks correct

LFS pointer metadata refresh only. No issues spotted.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Binary-pointer updated – looks correct

The oid and size entries were swapped for the new cubin blob; pointer syntax remains valid.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Binary-pointer updated – looks correct

No functional code touched aside from LFS metadata. Good to go.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Binary-pointer updated – looks correct

Change is limited to Git-LFS oid + size. No further review needed.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Binary pointer update looks fine

Only the Git-LFS pointer hash/size changed. No source logic to review.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64MultiCtasKvVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Binary pointer update looks fine

Only the Git-LFS pointer hash/size changed. No source logic to review.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Binary pointer update looks fine

Only the Git-LFS pointer hash/size changed. No source logic to review.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1)

2-3: Binary pointer update looks fine

Only the Git-LFS pointer hash/size changed. No source logic to review.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Binary pointer update looks fine

Only the Git-LFS pointer hash/size changed. No source logic to review.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ128Kv128StaticContext_cubin.cpp (1)

2-3: Verify that the new cubin is picked up by the runtime & CI artifacts.

The SHA 256 and size have changed, but nothing in CMake / build scripts was updated in this PR. Please double-check that:

  1. The new oid is correctly staged in LFS ( git lfs ls-files ).
  2. Any internal hash maps / version tables that reference this cubin (e.g. in fmhaDispatcher.cpp) don’t hard-code the previous oid.
  3. CI runs an inference smoke-test on SM100 GPUs to ensure the kernel loads and produces identical numerical results.
cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Manual disassembly needed for SM100 cubin

The repository doesn’t contain a standalone .cubin artifact (it’s generated at build time), so we couldn’t automatically verify the presence of SM100 code objects. Please, after building the FMHA kernel, run:

nvdisasm --print-code path/to/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen.cubin | head -n 20

and confirm you still see the .sm_100 code objects in the output.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ128Kv128StaticContext_cubin.cpp (1)

2-3: Check that dependent descriptor tables were regenerated together.

This cubin participates in the dense-KV path. Any mismatch between its new TMA descriptors and the updated kernelParams.h will cause silent wrong-results. Make sure kernelParamsUnitTest (if present) is re-run against this binary.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Multi-CTA variant – validate scheduler compatibility.

Because this variant relies on CUDA cooperative groups, ensure the kernel was re-compiled with the same -arch=sm_100 and -rdc=false flags; otherwise, multi-CTA synchronization may break on some driver versions.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvCausalP64VarSeqQ128Kv128StaticContext_cubin.cpp (1)

2-3: Ensure causal-path perf parity after cubin shrink.

The file size dropped ~39 %. Please compare end-to-end latency of a causal attention benchmark before/after this change to catch accidental register-spilling regressions.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1)

2-3: Binary kernel update for FP4 KV cache support.

The updated CUDA kernel binary reflects the coordinated recompilation to support NVFP4 KV cache quantization. The significant size reduction (from ~1.7MB to ~1MB) suggests optimization or updated compilation settings.

cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQAImplJIT/decoderXQAImplJIT.cpp (1)

306-306: Naming consistency fix for KV scale quantization parameter.

The field name change from kvScaleOrigQuant to kv_scale_orig_quant aligns with the standardized snake_case naming convention for FP4 KV cache scale factor handling across the codebase.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvCausalP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1)

2-3: Binary kernel update for FP4 KV cache support.

The updated CUDA kernel binary is part of the comprehensive FP4 KV cache quantization support rollout. The significant size reduction (from ~1.4MB to ~875KB) indicates substantial optimization or updated compilation settings.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Binary kernel update for FP4 KV cache support.

The updated CUDA kernel binary continues the pattern of coordinated recompilation for NVFP4 KV cache quantization support. The size reduction (from ~1MB to ~715KB) aligns with the optimization trend observed across all kernel updates.

cpp/tensorrt_llm/kernels/fmhaDispatcher.cpp (3)

138-138: Proper initialization of KV scale factor pool pointer.

The kvSfPoolPtr is correctly initialized to nullptr for safe handling of FP4 KV cache scale factors.


148-148: Conditional assignment of KV scale factor pool pointer.

The assignment from runnerParams.pagedKvSfCache.mPrimaryPoolPtr is correctly placed within the Q_PAGED_KV layout condition, ensuring it's only used for paged KV cache scenarios where FP4 scale factors are relevant.


169-169: Proper propagation of KV scale factor pointer to runner.

The kvSfPtr field assignment correctly passes the FP4 KV cache scale factor buffer to the FMHA runner parameters, completing the integration of FP4 KV cache support in the dispatcher.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: CUDA kernel binary updated for FP4 KV cache support.

The kernel binary has been recompiled with a 33% size reduction (887,191 → 583,642 bytes), likely reflecting optimizations for the new FP4 KV cache quantization mode support.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: CUDA kernel binary updated for FP4 KV cache support.

The kernel binary has been recompiled with a 33% size reduction (873,881 → 579,976 bytes), consistent with the optimization pattern across FP4-related kernel updates.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvCausalP64VarSeqQ128Kv128StaticContext_cubin.cpp (1)

2-3: Context CUDA kernel binary updated with significant optimization.

The context kernel binary shows a 39% size reduction (1,397,181 → 853,331 bytes), indicating substantial optimizations for FP4 KV cache support in context attention scenarios.

cpp/tensorrt_llm/common/attentionOp.h (1)

95-96: LGTM! Properly implemented FP4 KV cache scale pool pointers.

The new pointer members are correctly initialized to nullptr and follow the consistent naming convention with existing pool pointer members.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP32MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: CUDA kernel binary updated for FP4 KV cache support.

The H128 kernel binary shows a 33% size reduction (1,118,485 → 748,905 bytes), completing the systematic kernel recompilation for FP4 KV cache quantization support.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ128Kv128StaticContext_cubin.cpp (1)

2-3: Binary artifact refreshed — please run smoke-tests on real HW

We can’t statically review .cubin contents.
Before merging, ensure the new build (hash 19e710…, size 852 663) is:

• Executed on at least one SM100 board with end-to-end attention ops.
• Free of regressions vs. the previous cubin (perf / accuracy / crashes).
• Produced with the same compile flags & PTX ISA version expected by the runtime loader.

If CI does not already validate this, please trigger a short “FP4 KV-cache” generation test-suite.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1)

2-3: Validate cubin fingerprint & runtime load path

New LFS pointer (sha256 = 7d9e853d…) replaces the previous kernel image.
Please make sure that:

  1. The loader side (FMHA runner / dispatcher) is not doing any hard-coded size or checksum validation that would now fail.
  2. CI executes at least one end-to-end inference that exercises this kernel on an SM100 board to guarantee the rebuilt cubin is actually loadable.

If neither check is in place, add a minimal smoke-test; otherwise confirm the existing tests pass.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqQ16Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Static-swap cubin updated – verify CI still covers this code path

This variant is only hit with multi-CTA static-swap configs. Double-check the unit/functional tests still include that scenario; otherwise we risk shipping an untested cubin.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP64VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Check mixed P64 / H128 path after cubin refresh

This H128 / P64 kernel is critical for large-head models. Please verify the new cubin was compiled with the same max registers limit; regressions there often surface only at high batch sizes.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

1-4: Binary kernel update for FP4 KV Cache support.

The updated CUDA kernel binary shows a significant size reduction (34%) and new SHA-256 hash, indicating substantial changes to support NVFP4 KV Cache. The filename suggests specific support for E2M1 (FP4) format.

Since this is a binary artifact, please ensure comprehensive testing coverage for the new FP4 KV cache functionality to validate correctness and performance.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvSlidingOrChunkedCausalP64MultiCtasKvCgaVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

1-4: Consistent binary kernel update for FP4 KV Cache support.

This kernel binary also shows a ~34% size reduction and updated hash, consistent with the systematic kernel updates for NVFP4 KV Cache support. The H64 configuration complements the H128 variant from the previous file.

The consistent size reduction pattern across kernel binaries suggests systematic optimization. Ensure testing covers both H64 and H128 head configurations.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaRunner.cpp (1)

40-42: DATA_TYPE_E2M1 definition and usage confirmed
Verified that the new FP4 data type is declared and consistently referenced across the codebase:

  • Defined in multiHeadAttentionCommon.h (enum Data_type) and in fused_multihead_attention_utils.h
  • Used in FMHA runner validation (fmhaRunner.cpp), kernel parameters (kernelParams.h), dispatcher (xqaDispatcher.cpp), and attention operator (common/attentionOp.cpp)

No further changes required.

cpp/tensorrt_llm/pybind/batch_manager/kvCacheManager.cpp (1)

358-369: Well-implemented Python binding for block scale pool pointers.

The new get_block_scale_pool_pointers() method follows the established pattern used by similar methods like get_block_pool_pointers(). The implementation correctly handles optional tensor conversion and maintains consistency with the existing codebase.

cpp/tensorrt_llm/pybind/bindings.cpp (2)

160-160: LGTM: Clean addition of NVFP4 data type enum.

The new enum value correctly maps "NVFP4" to nvinfer1::DataType::kFP4, enabling Python access to the FP4 data type for KV cache quantization.


239-239: LGTM: Proper exposure of FP4 KV cache capability.

The new read-only property has_fp4_kv_cache correctly exposes the C++ method hasFp4KvCache(), allowing Python code to query FP4 KV cache quantization mode support.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvSlidingOrChunkedCausalP32VarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: Binary kernel artifact updated for FP4 support.

The CUDA kernel binary has been updated with a new SHA256 hash and reduced size (945,675 → 602,808 bytes). This aligns with the comprehensive FP4 KV cache support being added throughout the codebase. Since this is a binary artifact, the actual kernel implementation changes cannot be reviewed at the source level.

cpp/tensorrt_llm/kernels/unfusedAttentionKernels/unfusedAttentionKernels_2_template.h (5)

584-584: LGTM! Consistent variable renaming.

The renaming from kvScaleOrigQuant to kv_scale_orig_quant follows consistent naming conventions and maintains the same null-checking logic with appropriate fallback value.


636-639: LGTM! Consistent renaming in FP4 cache handling.

The variable renaming is applied consistently in the FP4 cache quantization logic while preserving the fallback mechanism between kv_cache_scale_factors and kv_scale_orig_quant.


1002-1003: LGTM! Consistent variable renaming in V2 kernel.

The variable renaming is consistently applied in the applyBiasRopeUpdateKVCacheV2 kernel function, maintaining the same null-checking logic and fallback behavior.


1059-1062: LGTM! Consistent variable renaming in V2 FP4 handling.

The variable renaming is consistently applied with proper null checking and fallback logic in the FP4 cache quantization section of the V2 kernel.


1428-1428: LGTM! Consistent variable renaming in cross-attention kernel.

The variable renaming follows the established pattern with proper null checking and fallback value in the cross-attention kernel function.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/fmhaRunnerParams.h (1)

195-196: LGTM! Structural improvement for FP4 KV cache support.

The addition of kvSfPtr as a packed KV scaling factor buffer consolidates the scaling factor handling, which aligns well with the FP4 KV cache quantization support being introduced in this PR. This unified approach should simplify the parameter passing and improve maintainability.

cpp/tensorrt_llm/kernels/unfusedAttentionKernels.h (1)

127-127: LGTM! Improved naming consistency.

The renaming from kvScaleOrigQuant to kv_scale_orig_quant follows the snake_case naming convention, which improves consistency across the codebase. The toString() method has been correctly updated to reflect this change.

Also applies to: 242-242

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3)

611-620: LGTM! Proper FP4 container packing implementation.

The code correctly handles FP4 data type where two 4-bit elements are packed into one 8-bit container. The divisibility check ensures proper alignment, and the pool size calculation is appropriately adjusted.

Also applies to: 631-631


716-721: LGTM! Consistent FP4 handling in block scale pools.

The code maintains consistency with the FP4 packing scheme by properly accounting for numEltsPerContainer in both the divisibility check and the block scale size calculation.

Also applies to: 726-729


751-765: LGTM! Appropriate FP4 storage representation.

The code correctly represents FP4 data as INT8 in memory pools, which aligns with the packing scheme where two 4-bit FP4 values are stored in one 8-bit container.

cpp/tensorrt_llm/kernels/xqaDispatcher.h (1)

81-82: All XqaDispatcher::run call sites have been updated to pass three buffers.

Both occurrences in attentionOp.cpp now call:

  • Line 1209: mXqaDispatcher->run(xqaParams, kv_cache_buffer, kv_scale_cache_buffer);
  • Line 2218: mXqaDispatcher->run(xqaParams, kv_cache_buffer, kv_scale_cache_buffer);

No two-argument invocations remain.

cpp/tensorrt_llm/kernels/xqaDispatcher.cpp (7)

52-54: LGTM! Correct extension for FP4 KV cache support.

The modification properly extends the data type check to include DATA_TYPE_E2M1 (FP4) alongside the existing FP8 support, maintaining the same conversion logic.


172-173: Good generalization of the data type check.

The change from hardcoded DATA_TYPE_E4M3 to mFixedParams.mathDataType makes the check more flexible and maintainable for supporting different quantization modes.


245-257: LGTM! Proper cache type determination for NVFP4.

The logic correctly extends the cache type detection to include FP4 KV cache quantization mode.


313-317: Consistent scale pointer handling for FP4/FP8.

The code correctly sets up scale pointers when either FP8 or FP4 KV cache is enabled. Note the member name change from kvScaleOrigQuant to kv_scale_orig_quant for consistency.


330-332: Clear documentation for NVFP4-specific behavior.

Good addition of explanatory comments about the second-level scale factors for NVFP4 KV cache.


449-461: Correct parameter propagation in KVLinearBuffer overload.

The new kv_cache_block_scales_buffer parameter is properly added and forwarded to runImpl.


464-476: Correct parameter propagation in KVBlockArray overload.

The new kv_cache_block_scales_buffer parameter is consistently added and forwarded to runImpl.

cpp/tensorrt_llm/thop/attentionOp.cpp (2)

256-279: Well-implemented FP4 block scale pool handling.

The code correctly:

  • Checks for FP4 KV cache mode
  • Validates required pointers with TLLM_CHECK
  • Calculates proper offsets using vector_size=16 and 1 byte per E4M3 scale factor
  • Assigns pointers with correct arithmetic

783-783: Correct Torch binding update for FP4 support.

The new host_kv_cache_block_scale_pool_pointers parameter is properly added to the Torch library fragment definition.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/kernelParams.h (5)

395-432: Well-designed TMEM storage mode for FP4 K/V data.

The implementation correctly handles two storage modes:

  • Standard mode: FP4 elements packed 2 per byte (divide by colIdxDivisor)
  • TMEM mode: dimensions reflect individual FP4 elements (no division)

The conditional logic properly adjusts TMA descriptors based on the storage mode.


475-505: Correct bit-level pointer arithmetic for FP4 support.

The conversion from byte-based to bit-based calculations properly handles variable bit widths, with correct division by 8 to convert to bytes for pointer arithmetic.


521-552: Excellent support for packed and unpacked FP4 formats.

The implementation correctly handles:

  • Packed FP4 format using CU_TENSOR_MAP_DATA_TYPE_UINT8
  • Unpacked FP4 format using CU_TENSOR_MAP_DATA_TYPE_16U4_ALIGN16B for TMEM
  • Appropriate swizzle type selection for each format

677-680: Hardware-specific TMEM optimization for FP4.

The detection logic for storeTransformedKvInTmem requires very specific conditions:

  • FP4 KV data type (E2M1)
  • FP8 Q data type (E4M3)
  • Head dimension of 128
  • Kernel uses swapped MMA operations

Please verify these specific hardware requirements are documented and that the optimization is beneficial for the targeted hardware configurations.


720-726: Good unification of KV scaling factor pointers.

The change from separate K/V scale pointers to a unified kvSfPtr simplifies the interface while maintaining functionality.

cpp/tensorrt_llm/common/attentionOp.cpp (15)

214-217: LGTM!

The FP4 KV cache data type handling is correctly implemented following the established pattern for other quantization modes.


931-932: Placeholder for future MLA support is appropriate.

The comment clearly indicates that NVFP4 KV cache support for MLA is not yet implemented, and the empty placeholder maintains API consistency.


1285-1299: Correct element size calculation for different quantization modes.

The refactored code properly handles bit widths for FP4 (4 bits), Int8/FP8 (8 bits), and full precision, with accurate conversion to bytes in the final calculation.


1308-1316: Scale buffer initialization for FP4 KV cache is properly implemented.

The code correctly initializes the scale buffer only when FP4 KV cache is enabled, using appropriate pool pointers and size calculation (1/8th of main buffer size for scale factors).


1324-1324: Important validation for FP4 KV cache constraints.

The check correctly enforces that FP4 KV cache only supports paged KV mode, which is likely required for efficient scale factor management at the block level.


1522-1535: Cache type determination correctly includes NVFP4.

The logic properly sets the cache type based on the quantization mode, with NVFP4 correctly added for FP4 KV cache.


1583-1599: Preprocessing parameters properly extended for FP4 scale factors.

The code correctly adds the scale buffer and documents the two-level scaling approach, with kv_scale_orig_quant as fallback when separate K/V scales are not provided.


1749-1749: Scale cache correctly passed to FMHA runner.

The pagedKvSfCache parameter is properly set to enable FMHA kernels to access FP4 scale factors.


2127-2134: Scale buffer initialization in generation phase is consistent.

The code correctly initializes the FP4 scale buffer following the same pattern as the context phase.


2142-2142: Consistent FP4 validation in generation phase.

The validation correctly enforces the paged KV requirement for FP4 cache in the generation phase.


2209-2209: Scale cache properly passed to XQA dispatcher.

The XQA dispatcher correctly receives the scale cache buffer for FP4 quantized attention computation.


2235-2238: Clear error messaging for unsupported FP4 KV cache.

The error check provides helpful feedback when FP4 KV cache is enabled but no supporting kernels are available.


2504-2507: Appropriate validation of FP4 KV cache requirements.

The check correctly enforces that FP4 KV cache requires FP8 context FMHA, ensuring the necessary infrastructure is available.


2584-2587: FMHA data type correctly configured for FP4 KV cache.

The code properly sets the KV data type to E2M1 when FP4 KV cache is enabled, ensuring FMHA kernels use the correct format.


2789-2793: XQA parameters correctly configured for FP4 KV cache.

The code appropriately sets KV data type to E2M1 (FP4) for storage and math data type to E4M3 (FP8) for computation, balancing storage efficiency with computational accuracy.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H128PagedKvDenseP64VarSeqQ128Kv128PersistentContext_cubin.cpp (1)

2-3: LGTM: Kernel binary updated for FP4 KV cache support.

The updated Git LFS pointer indicates successful recompilation of the FMHA kernel binary with FP4 support. The significant size reduction (from 1,635,271 to 984,837 bytes) suggests optimizations were applied during recompilation.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP64VarSeqQ8Kv128PersistentSwapsAbForGen_cubin.cpp (1)

2-3: LGTM: Kernel binary optimized for FP4 KV cache.

The updated LFS pointer shows successful kernel recompilation with size optimization. This aligns with the broader FP4 KV cache implementation across the kernel stack.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32VarSeqQ128Kv128PersistentContext_cubin.cpp (1)

2-3: LGTM: Consistent kernel recompilation for FP4 support.

The binary update maintains consistency with other kernel recompilations in this PR, supporting the new FP4 KV cache quantization features.

cpp/tensorrt_llm/kernels/trtllmGenKernels/fmha/cubin/FmhaSm100Kernel_QE4m3KvE2m1OE4m3H64PagedKvDenseP32MultiCtasKvVarSeqQ8Kv128StaticSwapsAbForGen_cubin.cpp (1)

2-3: LGTM: Final kernel binary completed for FP4 implementation.

This completes the set of kernel binary updates with another significant size reduction, indicating successful optimization during recompilation for FP4 KV cache support.

cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (2)

499-499: LGTM: Correct blockSize calculation for FP4 support.

Removing the division by quantSize is correct for FP4 data types where element count per container is managed explicitly rather than adjusting block size through division. This change aligns with the updated FP4 handling in the corresponding implementation file.


1254-1254: LGTM: Essential interface addition for FP4 block scale management.

The new getBlockScalePoolPointers() pure virtual method is necessary for FP4 KV cache quantization support, providing access to block scale pool pointers required for dequantization. This interface addition is properly designed and integrates well with the existing architecture.

@Tom-Zheng
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12513 [ run ] triggered by Bot

@Tom-Zheng Tom-Zheng requested a review from kaiyux July 22, 2025 04:00
@Tom-Zheng
Copy link
Collaborator Author

@kaiyux Hi, could you pls add appropriate reviewers, as this PR involves changes to KV cache manager.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12513 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #9300 completed with status: 'FAILURE'

@juney-nvidia juney-nvidia requested a review from lowsfer July 29, 2025 10:51
@Tom-Zheng Tom-Zheng force-pushed the support-nvfp4-kv-cache-2 branch from aacf6c6 to a477bff Compare August 7, 2025 03:14
@Tom-Zheng Tom-Zheng requested review from a team as code owners August 7, 2025 03:14
@Tom-Zheng Tom-Zheng changed the title [feat] Support NVFP4 KV Cache [None][feat] Support NVFP4 KV Cache Aug 7, 2025
@Tom-Zheng
Copy link
Collaborator Author

/bot run

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
cpp/tensorrt_llm/common/attentionOp.cpp (3)

931-932: Placeholder for future MLA support.

The comment clearly indicates that NVFP4 KV cache support for MLA is not yet implemented. Consider tracking this as a TODO item for future implementation.

Would you like me to create an issue to track the implementation of NVFP4 KV cache support for MLA?


1283-1299: Logic is correct, minor comment clarification needed.

The element size calculation correctly handles different quantization modes. Consider clarifying the comment on line 1299:

-    auto sizePerToken = mNumKVHeads * headSize * elemBits / 8 /*bits*/;
+    auto sizePerToken = mNumKVHeads * headSize * elemBits / 8 /*bits to bytes*/;

2093-2110: Consistent element size calculation for generation phase.

The logic matches the context phase implementation. Consider the same comment clarification:

-    auto const sizePerToken = mNumKVHeads * headSize * elemBits / 8 /*bits*/;
+    auto const sizePerToken = mNumKVHeads * headSize * elemBits / 8 /*bits to bytes*/;
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aacf6c6 and a477bff.

📒 Files selected for processing (5)
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h (3 hunks)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp (3 hunks)
  • cpp/tensorrt_llm/common/attentionOp.cpp (18 hunks)
  • cpp/tensorrt_llm/common/attentionOp.h (1 hunks)
  • cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • cpp/tensorrt_llm/common/attentionOp.h
  • cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp
  • cpp/include/tensorrt_llm/batch_manager/kvCacheManager.h
🧰 Additional context used
📓 Path-based instructions (2)
**/*.{cpp,h,hpp,cc,cxx}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

**/*.{cpp,h,hpp,cc,cxx}: Closing braces of namespaces should have a comment saying the namespace it closes (e.g., } // namespace foo).
Prefer const or constexpr variables over #defines whenever possible.
A variable that is not modified after its initialization should be declared as const.
Except 0 (used for checking signness/existence/emptiness), nullptr, true, false, all other literals should only be used for variable initialization.
Use the Allman indentation style for braces in C++ code.
Put the semicolon for an empty for or while loop in a new line.
The statement forming the body of a switch, while, do..while, or for statement shall be a compound statement (use brace-delimited statements).
If and else should always be followed by brace-delimited statements, even if empty or a single statement.
C++ filenames should use camel case with the first letter lowercase (e.g., thisIsAFilename.cpp), and all files involved in a compilation target must have case-insensitive unique filenames.
All types (including class names) should use camel case with uppercase first letter (e.g., FooBarClass).
Local variables, methods, and namespaces should use camel case with first letter lowercase (e.g., localFooBar).
Non-magic-number global variables that are non-static and not defined in anonymous namespace should use camel case prefixed by 'g' (e.g., gDontUseGlobalFoos).
Non-magic-number global variables that are static or defined in an anonymous namespace should use camel case prefixed by 's' (e.g., sMutableStaticGlobal).
Locally visible static variables should use camel case with lowercase prefix 's' as the first letter (e.g., static std::once_flag sFlag;).
Class member variables should use camel case prefixed with 'm' (e.g., mNbFooValues). Public member variables do not require the 'm' prefix but it is encouraged for clarity.
Enumerations, global constants, static constants at class-scope, and function-scope magic-number/literal constants should be uppercase snake case with prefix...

Files:

  • cpp/tensorrt_llm/common/attentionOp.cpp
**/*.{cpp,h,hpp,cc,cxx,cu,py}

📄 CodeRabbit Inference Engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the current year. This includes .cpp, .h, .cu, .py, and any other source files which are compiled or interpreted.

Files:

  • cpp/tensorrt_llm/common/attentionOp.cpp
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (12)
cpp/tensorrt_llm/common/attentionOp.cpp (12)

215-218: LGTM!

The FP4 KV cache support is correctly added with the appropriate data type DATA_TYPE_E2M1.


1309-1316: Correct implementation of FP4 scale buffer allocation.

The scale buffer is appropriately sized at 1/8th of the main KV cache buffer and uses dedicated pool pointers for memory management.


1324-1325: Good validation for FP4 KV cache constraints.

The runtime check correctly enforces that FP4 KV cache only supports paged KV cache mode.


1464-1465: Proper configuration for FP4 scale separation.

The decoder parameters correctly configure separate QKV scales when FP4 KV cache is enabled.


1523-1535: Correct cache type enumeration for FP4.

The cache type is properly set to KvCacheDataType::NVFP4 when FP4 KV cache is enabled.


1574-1576: Preprocessing parameters correctly extended for FP4.

The QKV preprocessing parameters properly include the scale buffer for FP4 KV cache support.

Also applies to: 1584-1584, 1597-1597


1747-1747: FMHA parameters properly configured for FP4 scale cache.

The paged KV scale cache buffer is correctly passed to the FMHA runner.


2121-2128: Consistent scale buffer allocation in generation phase.

The FP4 scale buffer allocation matches the context phase implementation appropriately.


2136-2136: Proper runtime validation and XQA integration for FP4.

The code correctly:

  1. Enforces paged KV requirement for FP4
  2. Passes scale buffer to XQA dispatcher
  3. Provides clear error when FP4 kernels are unavailable

Also applies to: 2208-2208, 2225-2228


2495-2496: Important dependency check for FP4 KV cache.

The requirement that FP8 context FMHA must be enabled for FP4 KV cache is properly enforced.


2574-2577: FMHA correctly configured for FP4 KV data type.

The KV data type is properly set to DATA_TYPE_E2M1 for FP4 KV cache in the FMHA configuration.


2779-2783: XQA dispatcher properly configured for FP4.

The configuration correctly uses DATA_TYPE_E2M1 for storage and DATA_TYPE_E4M3 for math operations, balancing storage efficiency with computational precision.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14368 [ run ] triggered by Bot

@Tom-Zheng
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14370 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14368 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14370 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10861 completed with status: 'FAILURE'

@Tom-Zheng
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14374 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #14374 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #10864 completed with status: 'FAILURE'

@PerkzZheng
Copy link
Collaborator

@PerkzZheng @yuxianq pls follow up with Tian to make sure there is an aligned scope impact of this PR.

@Tom-Zheng can you describe what is the expected behavior of nvfp4 kv cache in the context phase (when using chunked prefill/kv cache reuse) ? are they compatible ? thanks.

@Tom-Zheng
Copy link
Collaborator Author

Tom-Zheng commented Aug 22, 2025

@PerkzZheng @yuxianq pls follow up with Tian to make sure there is an aligned scope impact of this PR.

@Tom-Zheng can you describe what is the expected behavior of nvfp4 kv cache in the context phase (when using chunked prefill/kv cache reuse) ? are they compatible ? thanks.

The behavior regarding chunked prefill/kv cache reuse is the same as FP8 KV, both kernel side and KV cache manager side. For KV cache manager, there is another pool (block scale pool) to store the scaling factors. It has the same structure with the data pool. It allocates/copies together with the data pool.

@PerkzZheng
Copy link
Collaborator

@PerkzZheng @yuxianq pls follow up with Tian to make sure there is an aligned scope impact of this PR.

@Tom-Zheng can you describe what is the expected behavior of nvfp4 kv cache in the context phase (when using chunked prefill/kv cache reuse) ? are they compatible ? thanks.

The behavior regarding chunked prefill/kv cache reuse is the same as FP8 KV, both kernel side and KV cache manager side. For KV cache manager, there is another pool (block scale pool) to store the scaling factors. It has the same structure with the data pool. It allocates/copies together with the data pool.

okay, thanks for confirming, so the nvfp4 is fully supported in both context and generation phase.

Copy link
Collaborator

@HuiGao-NV HuiGao-NV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@Superjomn Superjomn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM on the llmapi changes

Signed-off-by: Tian Zheng <[email protected]>
@aur61
Copy link

aur61 commented Aug 28, 2025

Hi @Tom-Zheng, for NVFP4 KV, does it mean that the attention (Q@K, A@V) is calculated with FP4 TMA?

@Tom-Zheng
Copy link
Collaborator Author

Hi @Tom-Zheng, for NVFP4 KV, does it mean that the attention (Q@K, A@V) is calculated with FP4 TMA?

Math is in FP8 for now.

@mengniwang95
Copy link

Hi @Tom-Zheng, for NVFP4 KV, does it mean that the attention (Q@K, A@V) is calculated with FP4 TMA?

Math is in FP8 for now.

Hi @Tom-Zheng , does trt-llm support FP4 Gemm in fmha?

Signed-off-by: Tian Zheng <[email protected]>
@Tom-Zheng
Copy link
Collaborator Author

Hi @Tom-Zheng, for NVFP4 KV, does it mean that the attention (Q@K, A@V) is calculated with FP4 TMA?

Math is in FP8 for now.

Hi @Tom-Zheng , does trt-llm support FP4 Gemm in fmha?

Not yet.

@Tom-Zheng
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16936 [ run ] triggered by Bot

@juney-nvidia
Copy link
Collaborator

juney-nvidia commented Aug 29, 2025

@Tom-Zheng Hi Tian, is there any plan to prepare a doc for this NVFP4 KV Cache feature?

Also in the doc I would expect that there can be clear introduction about the interactions of this feature with other features, such as:

  • Different attention kernel implementation, TRTLLM-Gen/XQA/fmha_v2/MMHA.
  • Dis-agg serving
  • BF16 attention/FP8 attention(it looks that it should work based on your previous replies).
  • Maybe others?

This requirement can be more than before since we are trying to make sure new features contributed into TensorRT-LLM can have a coherent interactions with the existing features to reduce the future tech debts. You can also reach out to @QiJune and @nv-guomingz so they can provide the needed support.

Another question: is there any expected perf speed up with this PR for certain scenario?
Thanks
June

@Tom-Zheng
Copy link
Collaborator Author

Tom-Zheng commented Aug 29, 2025

@Tom-Zheng Hi Tian, is there any plan to prepare a doc for this NVFP4 KV Cache feature?

Also in the doc I would expect that there can be clear introduction about the interactions of this feature with other features, such as:

  • Different attention kernel implementation, TRTLLM-Gen/XQA/fmha_v2/MMHA.
  • Dis-agg serving
  • BF16 attention/FP8 attention(it looks that it should work based on your previous replies).
  • Maybe others?

This requirement can be more than before since we are trying to make sure new features contributed into TensorRT-LLM can have a coherent interactions with the existing features to reduce the future tech debts. You can also reach out to @QiJune and @nv-guomingz so they can provide the needed support.

Sure, I'll work with team to add the documentation. The plan is to get this PR merged, then create another PR for documentation. Does that work?

is there any expected perf speed up with this PR for certain scenario?

Long sequence (>4K) decode phase should see some speedup.

Signed-off-by: Tian Zheng <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

766-773: Same ‘self’ undefined bug in remaining-layers path.

Mirror the fix here to resolve Ruff F821.

Apply:

-                if dtype == DataType.NVFP4:
-                    cache_size_bytes_per_token += self.calculate_scaling_factor_size_bytes(
+                if dtype == DataType.NVFP4:
+                    cache_size_bytes_per_token += KVCacheManager.calculate_scaling_factor_size_bytes(
                         cache_size_per_token,
                         quant_vector_size=16,
                         scaling_factor_dtype=DataType.FP8)
🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

1-1: Add NVIDIA copyright header.

Per coding guidelines, prepend the current-year NVIDIA copyright header to this Python file.

Apply:

+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 21928e4 and 17906ad.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,cc,cxx,cu,py,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces only; indent 4 spaces

Files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports; prefer from package.subpackage import module then module.Symbol
Python file names use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variable names use snake_case; if starting with a number, prefix k (e.g., k_99th_percentile)
Global variables use G_ prefix and UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE
Avoid shadowing variables from an outer scope
Initialize all externally visible members of a class in init
For interfaces used outside a file, prefer docstrings; reserve comments for internal code or local interfaces
Use Google-style docstrings for classes and functions; document attributes/variables inline as shown
Avoid reflection when simple, explicit code suffices (e.g., prefer def make_complex(x,y) over locals()/dict tricks)
Catch the narrowest exceptions possible in try/except
For duck-typing try/except, keep try body minimal and use else for main logic

Files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
🧠 Learnings (2)
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
PR: NVIDIA/TensorRT-LLM#7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
tensorrt_llm/_utils.py (4)
  • get_size_in_bytes (210-213)
  • nvtx_range (850-869)
  • dtype (942-943)
  • dtype (950-960)
🪛 Ruff (0.12.2)
tensorrt_llm/_torch/pyexecutor/resource_manager.py

736-736: Undefined name self

(F821)


769-769: Undefined name self

(F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (4)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)

17-17: Import updates look good.

Switch to get_size_in_bytes/nvtx_range is appropriate.


536-543: Confirm scale dtype choice for NVFP4.

You assume FP8 scales. If kernel/manager uses a different scale dtype (e.g., uint8/log or per-block metadata), size math must match it.

Would you confirm the scale dtype via the C++ API (get_block_scale_pool_pointers element type) and update scaling_factor_dtype if needed?


532-543: calculate_scaling_factor_size_bytes isn’t static; adjust or keep instance call
NVFP4→4 mapping already present in _binding_dtype_bits. Either remove the suggestion to call via KVCacheManager.calculate_scaling_factor_size_bytes (keep self.calculate_scaling_factor_size_bytes), or refactor the helper to a true @staticmethod (drop the self parameter) before invoking it on the class.

Likely an incorrect or invalid review comment.


350-357: Ignore refactor: stacking block‐scale pointers into kv_cache_pool_pointers is intentional and downstream kernels handle both shapes (see attentionOp.cpp lines 271 and 292).

Likely an incorrect or invalid review comment.

Signed-off-by: Tian Zheng <[email protected]>
Signed-off-by: Tian Zheng <[email protected]>
@Tom-Zheng
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16954 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #16936 [ run ] completed with state ABORTED

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

511-518: Fix NVFP4 scale-bytes math: use ceil grouping, not divisibility + floor.

cache_size may not be a multiple of quant_vector_size. The current assert is too strict and undercounts when not divisible.

Apply:

-    @staticmethod
-    def calculate_scaling_factor_size_bytes(
-            cache_size: int, quant_vector_size: int,
-            scaling_factor_dtype: DataType) -> int:
-        assert cache_size % quant_vector_size == 0, "NVFP4 cache size must be divisible by quant vector size"
-        return get_size_in_bytes(cache_size // quant_vector_size,
-                                 scaling_factor_dtype)
+    @staticmethod
+    def calculate_scaling_factor_size_bytes(
+            cache_size: int, quant_vector_size: int,
+            scaling_factor_dtype: DataType) -> int:
+        """Return extra bytes for per-block scales: ceil(cache_size / quant_vector_size) scales."""
+        assert cache_size > 0 and quant_vector_size > 0, "cache_size and quant_vector_size must be positive"
+        groups = (cache_size + quant_vector_size - 1) // quant_vector_size
+        return get_size_in_bytes(groups, scaling_factor_dtype)
🧹 Nitpick comments (3)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)

1-1: Add NVIDIA copyright header.

Per repo guidelines, prepend the 2025 NVIDIA copyright header to all source files.


350-357: Stacking pool pointers changes tensor rank; verify downstream consumers.

self.kv_cache_pool_pointers can be 1-D or 2-D depending on NVFP4 scale availability. Confirm all call sites handle both shapes, or keep a stable shape and expose scales separately.

Option A (stable shape): always stack and pad with zeros when scales absent.
Option B (BC-safe): keep existing self.kv_cache_pool_pointers and store scales in a new attribute self.kv_cache_block_scale_pool_pointers.

I can generate a patch once you confirm preferred contract.


537-544: Deduplicate NVFP4 constants.

Replace magic numbers with module-level constants to keep FP4 packing coherent across sites.

+NVFP4_QUANT_VECTOR_SIZE = 16
+NVFP4_SCALE_DTYPE = DataType.FP8
...
-            cache_size_bytes_per_token += KVCacheManager.calculate_scaling_factor_size_bytes(
-                cache_size_per_token, quant_vector_size=16, scaling_factor_dtype=DataType.FP8)
+            cache_size_bytes_per_token += KVCacheManager.calculate_scaling_factor_size_bytes(
+                cache_size_per_token, quant_vector_size=NVFP4_QUANT_VECTOR_SIZE, scaling_factor_dtype=NVFP4_SCALE_DTYPE)

Also applies to: 734-741, 767-774

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4b8eb90 and 8e5d33a.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (6 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{cpp,cc,cxx,cu,py,h,hpp,hh,hxx,cuh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use spaces only; indent 4 spaces

Files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+
Indent with 4 spaces; do not use tabs
Preserve module namespaces in imports; prefer from package.subpackage import module then module.Symbol
Python file names use snake_case (e.g., some_file.py)
Class names use PascalCase
Function and method names use snake_case
Local variable names use snake_case; if starting with a number, prefix k (e.g., k_99th_percentile)
Global variables use G_ prefix and UPPER_SNAKE_CASE (e.g., G_MY_GLOBAL)
Constants use UPPER_SNAKE_CASE
Avoid shadowing variables from an outer scope
Initialize all externally visible members of a class in init
For interfaces used outside a file, prefer docstrings; reserve comments for internal code or local interfaces
Use Google-style docstrings for classes and functions; document attributes/variables inline as shown
Avoid reflection when simple, explicit code suffices (e.g., prefer def make_complex(x,y) over locals()/dict tricks)
Catch the narrowest exceptions possible in try/except
For duck-typing try/except, keep try body minimal and use else for main logic

Files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
**/*.{cpp,cc,cxx,cu,h,hpp,hh,hxx,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend NVIDIA copyright header (current year) to all source files (.cpp, .h, .cu, .py, etc.)

Files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
🧠 Learnings (3)
📓 Common learnings
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
PR: NVIDIA/TensorRT-LLM#6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
PR: NVIDIA/TensorRT-LLM#7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
🧬 Code graph analysis (1)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
tensorrt_llm/_utils.py (5)
  • binding_to_str_dtype (200-203)
  • get_size_in_bytes (210-213)
  • nvtx_range (850-869)
  • dtype (942-943)
  • dtype (950-960)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (3)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (3)

17-17: Import change LGTM.

Switch to get_size_in_bytes centralizes dtype math and removes ad-hoc byte calculations.


533-535: Enable NVFP4 in supported dtypes — LGTM.


545-574: Float math for max_tokens is fine; ensure MPI reduction stays numeric.

max_tokens is float; MPI.MIN over float is intended. Downstream ceil() is applied before block counts.

Do we need to clamp max_tokens to >= 0 after allreducing in heterogeneous nodes? If yes, add max(0, ...).

@Tom-Zheng
Copy link
Collaborator Author

@juney-nvidia I had a sync with @nv-guomingz. We decided to add the document after #6696 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.