Draft: kvbm - bigger page size for offload block pools only #2986

da-x · 2025-09-10T14:20:12Z

This change allows the host and disk pool to have larger blocks compared to the device block pool, with a fixed block size ratio between these two pool classes. During offload and onboarding of blocks, we need to unite small device blocks to large offload blocks or to split large offloaded blocks back to small device blocks.

The advantage conferred by this change: more efficient management of offloaded blocks, especifically when transferring back and forth from disk, larger IOs can always be used.

To do:

Fix onboarding from disk to device
Fix support for TRTLLM
Add unit tests

Summary by CodeRabbit

New Features
- Separate engine and offload block/page sizes with a configurable ratio (default 32) for improved cache and I/O efficiency.
- Fragmentation-aware block data and transfers, enabling per-fragment views and correct copy semantics across device/host/disk.
- Enhanced connector logic aligning token/block arithmetic with engine/offload sizes; added metrics wiring.
Refactor
- Updated Python bindings and leaders/workers to require engine_page_size and offload_page_size instead of a single page_size.
- Worker configuration and layout inference now distinguish engine vs. offload sizing.
Chores
- Added smallvec dependency.

This change allows the host and disk pool to have larger blocks compared to the device block pool, with a fixed block size ratio between these two pool classes. During offload and onboarding of blocks, we need to unite small device blocks to large offload blocks or to split large offloaded blocks back to small device blocks. The advantage conferred by this change: more efficient management of offloaded blocks, especifically when transferring back and forth from disk, larger IOs can always be used. To do: - Fix onboarding from disk to device - Fix support for TRTLLM - Add unit tests Signed-off-by: Dan Aloni <[email protected]>

copy-pr-bot · 2025-09-10T14:20:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2025-09-10T14:20:22Z

👋 Hi da-x! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

coderabbitai · 2025-09-10T14:36:09Z

Walkthrough

Introduces separate engine and offload block/page sizes across KV block manager, vLLM/TRT-LLM connectors, and Python bindings. Adds fragmentation support using SmallVec, refactors transfer logic to operate on fragment-aware data, and updates builders/configs and accessors accordingly. Updates Python integration to pass both sizes. Adds smallvec dependency.

Changes

Cohort / File(s)	Summary
Dependencies `lib/llm/Cargo.toml`, `lib/bindings/python/Cargo.toml`	Add dependency: smallvec = "1.15.1" with serde feature.
Core KV Block Manager config & API `lib/llm/src/block_manager.rs`, `lib/llm/src/block_manager/config.rs`, `lib/llm/src/block_manager/state/resources.rs`, `lib/llm/src/block_manager/distributed/worker.rs`	Replace single block/page size with engine/offload sizes and a ratio; add accessors; update config (offload_block_size_ratio); adjust resource layout sizing; worker config fields renamed/added and used in layout/memory sizing.
Fragmentation & transfer refactor `lib/llm/src/block_manager/block/data.rs`, `lib/llm/src/block_manager/block/data/local.rs`, `lib/llm/src/block_manager/block/data/logical.rs`, `lib/llm/src/block_manager/block/transfer/cuda.rs`, `lib/llm/src/block_manager/distributed/utils.rs`, `lib/llm/src/block_manager/distributed/transfer.rs`	Introduce SmallVec-backed fragmentation (multiple fragment IDs), add fragment-aware views/accessors, update CUDA copy to handle fragmented src/dst, change transfer request blocks to SmallVec pairs, refactor transfer handler to LocalBlockDataBase-based access.
Python Rust bindings: BlockManager `lib/bindings/python/rust/llm/block_manager.rs`, `lib/bindings/python/rust/llm/block_manager/distributed/worker.rs`	Split block/page size surface into engine/offload; builder gains engine_page_size/offload_page_size; worker builder calls updated accordingly.
vLLM integration: Slot/Connector `lib/bindings/python/rust/llm/block_manager/vllm.rs`, `lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs`, `lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs`, `lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs`, `lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs`	SlotManager and connectors migrate from single size to engine/offload sizes; add ratio handling; update alignment/arithmetic and policy logic; offload/onboard now chunk by engine-per-offload grouping; integrate KvbmMetrics where applicable.
TRT-LLM connector updates `lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs`, `lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs`	Replace page_size with engine_page_size and offload_page_size in leader/worker; update PyO3 signatures and builder chains.
Python integration surface `lib/bindings/python/src/dynamo/llm/vllm_integration/connector_leader.py`	Passes new extra integer (offload size) argument to RustKvConnectorLeader constructor.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Py as Python (vLLM/TRT-LLM)
  participant PyBind as PyO3 Binding
  participant Bld as BlockManagerBuilder
  participant KBM as KvBlockManager
  participant Slot as SlotManager

  Note over Py,PyBind: Initialization with dual sizes
  Py->>PyBind: new(engine_page_size, offload_page_size, leader)
  PyBind->>Bld: .engine_page_size(eps).offload_page_size(ops)
  Bld->>KBM: build() with eps, ops, ratio
  KBM-->>PyBind: KvBlockManager{engine, offload, ratio}
  PyBind->>Slot: new(engine_block_size, offload_block_size)
  Slot-->>Py: ready

  rect rgba(200,230,255,0.3)
  Note right of Slot: Token/block math uses engine size on device<br/>and offload size on host/disk
  end

sequenceDiagram
  autonumber
  participant Src as Source BlockData
  participant Dst as Dest BlockData
  participant Xfer as CUDA copy_layers

  Note over Src,Dst: Fragment-aware transfer
  Xfer->>Src: is_fragmented?
  alt Src fragmented
    loop for each src fragment
      Xfer->>Dst: copy fragment -> contiguous dst region (advance dst)
    end
  else Dst fragmented
    loop for each dst fragment
      Xfer->>Dst: copy contiguous src -> dst fragment (advance src)
    end
  else
    Xfer->>Dst: single memcpy
  end

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

merge with main #2252 — Similar code paths: dual engine/offload sizes and SmallVec fragmentation across block manager and vLLM bindings.
feat: kvbm + connector #2258 — Expands vLLM/KVBM connector APIs and builder surfaces aligned with engine/offload sizing and fragment handling.
feat: DIS-373 dynamo KVBM connector API integration with TRTLLM #2544 — Modifies vLLM/TRTLLM connectors and Python-Rust bindings in overlapping modules with this PR’s API changes.

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The current description does not follow the repository’s required template: it lacks the “#### Overview,” “#### Details,” “#### Where should the reviewer start?” and “#### Related Issues” sections and instead provides free-form paragraphs and a TODO list. This omission makes it unclear where reviewers should focus and which issue the PR addresses.	Please update the PR description to use the repository’s template by adding section headings for Overview, Details, Where should the reviewer start, and Related Issues (with appropriate issue references), and then populate each section with the corresponding information about what was changed, why, which files to review, and any linked issues.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title “Draft: kvbm - bigger page size for offload block pools only” refers directly to a real aspect of the changeset—namely increasing offload block sizes—but includes a “Draft:” prefix and uses an internal acronym, which adds noise without clarifying the primary change. It nevertheless identifies a key part of the PR’s intent by highlighting offload block pool sizing.
Docstring Coverage	✅ Passed	Docstring coverage is 84.34% which is sufficient. The required threshold is 80.00%.

Poem

A rabbit taps bytes with tiny paws,
Splitting blocks by tidy laws;
Engine nibble, offload chew,
SmallVec pockets, fragments new.
Hops through slots, transfers bright—
Two sizes dance in perfect byte.
Thump-thump: it all aligns just right.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 22

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)

lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs (1)

166-176: Compute/validate sizes and avoid overflow on multiplication.

Use checked_mul and enforce offload_page_size % engine_page_size == 0 before building the config.

Apply this diff:

@@
-        let config = KvbmWorkerConfig::builder()
+        // Derive and validate sizes
+        let engine_page_size = page_size;
+        let offload_page_size = engine_page_size
+            .checked_mul(self.offload_block_size_ratio)
+            .ok_or_else(|| anyhow::anyhow!(
+                "offload_page_size overflow: engine_page_size={} ratio={}",
+                engine_page_size, self.offload_block_size_ratio
+            ))?;
+        debug_assert!(
+            offload_page_size % engine_page_size == 0,
+            "offload_page_size must be a multiple of engine_page_size"
+        );
+
+        let config = KvbmWorkerConfig::builder()
             .drt(self.drt.clone())
             .num_device_blocks(num_device_blocks)
-            .offload_page_size(page_size * self.offload_block_size_ratio)
-            .engine_page_size(page_size)
+            .offload_page_size(offload_page_size)
+            .engine_page_size(engine_page_size)
             .tensors(vllm_tensors)
             .device_id(device_id)
             .dtype_width_bytes(dtype_width_bytes)
             .barrier_id_prefix(get_barrier_id_prefix())
             .scheduler_client(Some(self.transfer_client.clone()))
             .build()?;

Also applies to: 169-171

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)

87-96: Validate engine/offload sizes and log ratio at construction.

Guard against zero/invalid sizes and non-integer ratios to prevent downstream inconsistencies.

Apply this diff:

@@
     pub fn new(
         worker_id: String,
         drt: PyDistributedRuntime,
-        engine_page_size: usize,
-        offload_page_size: usize,
+        engine_page_size: usize,
+        offload_page_size: usize,
         leader_py: PyKvbmLeader,
     ) -> Self {
@@
-        let kvbm_metrics = KvbmMetrics::new(&ns);
+        // Validate sizes upfront
+        assert!(engine_page_size > 0 && offload_page_size > 0, "page sizes must be > 0");
+        assert!(
+            offload_page_size % engine_page_size == 0,
+            "offload_page_size ({}) must be a multiple of engine_page_size ({})",
+            offload_page_size, engine_page_size
+        );
+        let ratio = offload_page_size / engine_page_size;
+        tracing::info!("engine_page_size={} offload_page_size={} ratio={}", engine_page_size, offload_page_size, ratio);
+
+        let kvbm_metrics = KvbmMetrics::new(&ns);
@@
                 let block_manager = match BlockManagerBuilder::new()
                     .worker_id(0)
                     .leader(leader_py)
-                    .engine_page_size(engine_page_size)
-                    .offload_page_size(offload_page_size)
+                    .engine_page_size(engine_page_size)
+                    .offload_page_size(offload_page_size)
                     .disable_device_pool(false)
                     .build()
                     .await

Also applies to: 147-149

lib/bindings/python/rust/llm/block_manager/vllm.rs (1)

584-682: Switch matching logic to offload-sized blocks

get_num_new_matched_tokens currently uses engine_block_size for hashing, offsets and token arithmetic, but host()/disk() pools operate on offload_block_size. Update to:

compute completed offload blocks: num_completed_offload_blocks = num_computed_tokens / self.offload_block_size

call slot.sequence_hashes_offload() instead of slot.sequence_hashes(...)

set search_offset = num_completed_offload_blocks

calculate num_new_matched_tokens = num_matched_blocks * self.offload_block_size

subtract self.offload_block_size in the boundary-pop case

Add a sequence_hashes_offload() API if missing.

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (1)

549-582: Add tests and document alignment logic in slot.rs
File: lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (549–582)
No existing tests cover the aligned_candidates calculation or offload_block_size_ratio path. Add unit tests for:

num_candidate_blocks divisible by offload_block_size_ratio

num_candidate_blocks with a non-zero remainder

num_candidate_blocks smaller than offload_block_size_ratio
Insert inline comments clarifying the math, for example:
// Calculate the largest multiple of offload_block_size_ratio
// that does not exceed num_candidate_blocks
let aligned_candidates = (num_candidate_blocks / self.offload_block_size_ratio)
    * self.offload_block_size_ratio;

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (2)

112-114: Assert ratio invariants between engine/offload page sizes.

Offload must be a multiple of engine page size; assert early to catch misconfig before scheduling.

Apply:

         let kvbm_metrics = KvbmMetrics::new(&ns);
         let kvbm_metrics_clone = kvbm_metrics.clone();
+
+        debug_assert!(
+            engine_page_size > 0 && offload_page_size > 0,
+            "page sizes must be > 0"
+        );
+        debug_assert!(
+            offload_page_size % engine_page_size == 0,
+            "offload_page_size ({}) must be a multiple of engine_page_size ({})",
+            offload_page_size,
+            engine_page_size
+        );

295-301: Guard against underflow and enforce invariants in computed-tokens math.

If inputs ever drift, this can go negative or misalign silently. Add debug asserts.

Apply:

         // the second call will show num_external_tokens == 0
         // this call is just letting us know the other blocks that are being used for the remainder of the prefill
         if num_external_tokens > 0 {
+            debug_assert!(
+                !block_ids.is_empty(),
+                "num_external_tokens > 0 requires at least one device block"
+            );
+            debug_assert!(
+                num_external_tokens <= block_ids.len() * self.engine_page_size,
+                "num_external_tokens ({}) exceeds capacity of provided device blocks ({} * {})",
+                num_external_tokens,
+                block_ids.len(),
+                self.engine_page_size
+            );
             let num_computed_tokens = block_ids.len() * self.engine_page_size - num_external_tokens;
             slot.record_cached_device_tokens(num_computed_tokens);
             slot.advance_computed_position(num_computed_tokens)?;

🧹 Nitpick comments (27)

lib/bindings/python/Cargo.toml (1)

55-55: Unify smallvec version across the workspace (or gate features).

Looks fine, but please ensure this exact version/feature set matches other crates (llm also adds smallvec). Consider moving to [workspace.dependencies] to prevent drift; if not, keep both crates pinned to the same version/features to avoid duplicate resolutions.

lib/llm/Cargo.toml (1)

159-159: Match Python binding’s smallvec and consider workspace unification.

Same note here: align with lib/bindings/python to the exact version/features or promote to [workspace.dependencies] for consistency.
lib/llm/src/block_manager/config.rs (1)
210-215: Name/docs mismatch vs usage; add guardrails.

The field is named “offload_block_size_ratio” but is used to scale page_size elsewhere. Consider renaming to offload_page_size_ratio or clarifying docs to “page size,” and add upper-bound validation (or cfg-driven limit) to prevent pathological ratios.

Apply this doc tweak locally:
-    /// Ratio between offload block size and engine block size
-    /// Offload blocks are larger to enable more efficient I/O operations
+    /// Ratio between offload page size and engine page size
+    /// Offload pages are larger to enable more efficient I/O operations
lib/llm/src/block_manager/distributed/utils.rs (2)
53-61: Use smallvec! macro for clarity and tiny perf win.

Constructing a single-element SmallVec via iterator is noisy.
-            blocks: blocks
-                .into_iter()
-                .map(|(src, dst)| {
-                    (
-                        [src].iter().map(|x| *x).collect(),
-                        [dst].iter().map(|x| *x).collect(),
-                    )
-                })
-                .collect(),
+            blocks: blocks
+                .into_iter()
+                .map(|(src, dst)| (smallvec![src], smallvec![dst]))
+                .collect(),
75-83: Ditto: simplify constructor mapping with smallvec! macro.

Same simplification here.
-            blocks: blocks
-                .into_iter()
-                .map(|(src, dst)| {
-                    (
-                        [src].iter().map(|x| *x).collect(),
-                        [dst].iter().map(|x| *x).collect(),
-                    )
-                })
-                .collect(),
+            blocks: blocks
+                .into_iter()
+                .map(|(src, dst)| (smallvec![src], smallvec![dst]))
+                .collect(),
lib/llm/src/block_manager.rs (2)
114-117: Avoid storing redundant derived state or assert invariants to prevent drift

You store both engine_block_size, offload_block_size, and the ratio. Either:

compute offload_block_size on demand from engine_block_size * offload_block_size_ratio, or

assert invariants at construction (offload_block_size >= engine_block_size, divisible, matches ratio).

Given this struct is Clone and long-lived, a future mutation could desync fields.

Apply simple invariant checks during construction (see suggested diffs on new()).

135-147: Clarify accessor docs and units

Both engine_block_size() and offload_block_size() are documented as “Get the block size.” Be explicit about engine/device vs. offload/host-disk and units (bytes).

Apply:
-    /// Get the block size
+    /// Get the engine (device) block size in bytes.
     pub fn engine_block_size(&self) -> usize {
         self.engine_block_size
     }

-    /// Get the block size
+    /// Get the offload (host/disk) block size in bytes.
     pub fn offload_block_size(&self) -> usize {
         self.offload_block_size
     }

-    /// Get the offload block size ratio
+    /// Get the offload block-size ratio (offload/engine). Always >= 1.
     pub fn offload_block_size_ratio(&self) -> usize {
         self.offload_block_size_ratio
     }
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs (1)

133-135: Use distinct offload_page_size for TRT-LLM
TRT-LLM currently sets both offload_page_size and engine_page_size to the same page_size, forfeiting potential larger offload I/O. Consider plumbing a separate offload_page_size (e.g. page_size * offload_block_size_ratio) to match vLLM’s behavior.

lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs (1)

74-76: Make offload_block_size_ratio configurable and validated against leader.

A hard-coded ratio (32) risks drift with the leader’s explicit engine/offload sizes. Expose this via config/constructor or derive it from leader metadata, and validate consistency at registration time.

Would you like me to wire a setter from Python or plumb it from the leader metadata?

Also applies to: 121-121

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)

188-196: Surface sizes via metrics or getters to aid debugging.

Consider exporting engine/offload sizes (and ratio) in KvbmMetrics to catch configuration drift at runtime.

Also applies to: 190-192
lib/llm/src/block_manager/block/data/logical.rs (3)
59-59: Avoid allocation; initialize SmallVec directly.

Use SmallVec::from for the single-element case.

Apply this diff:
-            block_ids: [block_id].iter().map(|x| *x).collect(),
+            block_ids: SmallVec::from([block_id]),
75-82: Fix panic message and add backtrace for fragmented block misuse.

Current message references LocalBlockData; add backtrace for parity.

Apply this diff:
     fn block_id(&self) -> BlockId {
-        if self.block_ids.len() == 1 {
-            self.block_ids[0]
-        } else {
-            panic!("used LocalBlockData::block_id() for fragmented block");
-        }
+        if self.block_ids.len() == 1 {
+            self.block_ids[0]
+        } else {
+            tracing::error!("Backtrace: {}", std::backtrace::Backtrace::force_capture());
+            panic!("used LogicalBlockData::block_id() for fragmented block");
+        }
     }
83-91: Bounds-check fragment index and fix panic wording.

Prevent generic index panics and standardize error text.

Apply this diff:
     #[inline(always)]
     fn fragment_block_id(&self, idx: usize) -> BlockId {
-        if self.block_ids.len() != 1 {
-            self.block_ids[idx]
-        } else {
-            tracing::error!("Backtrace: {}", std::backtrace::Backtrace::force_capture());
-            panic!("used LocalBlockData::fragment_block_id() for non-fragmented block");
-        }
+        if self.block_ids.len() != 1 {
+            self.block_ids
+                .get(idx)
+                .copied()
+                .unwrap_or_else(|| {
+                    tracing::error!("Backtrace: {}", std::backtrace::Backtrace::force_capture());
+                    panic!(
+                        "LogicalBlockData::fragment_block_id() idx {} out of range (len={})",
+                        idx,
+                        self.block_ids.len()
+                    )
+                })
+        } else {
+            tracing::error!("Backtrace: {}", std::backtrace::Backtrace::force_capture());
+            panic!("used LogicalBlockData::fragment_block_id() for non-fragmented block");
+        }
     }
lib/bindings/python/rust/llm/block_manager.rs (2)
241-243: Defaults make sense but Default::default() remains footgun

new() sets good defaults, but deriving Default will still produce zeros if someone uses BlockManagerBuilder::default() directly.

Provide an explicit Default to mirror new():
impl Default for BlockManagerBuilder {
    fn default() -> Self {
        Self {
            worker_id: 0,
            leader: None,
            engine_page_size: 32,
            offload_page_size: 1024,
            disable_device_pool: false,
        }
    }
}
251-257: Setters are fine; add basic guards

Tiny guardrails help fail-fast on obvious misconfigurations.
 pub fn engine_page_size(mut self, ps: usize) -> Self {
-        self.engine_page_size = ps;
+        debug_assert!(ps.is_power_of_two(), "engine_page_size should be a power of two");
+        self.engine_page_size = ps;
         self
 }
 pub fn offload_page_size(mut self, ps: usize) -> Self {
-        self.offload_page_size = ps;
+        debug_assert!(ps.is_power_of_two(), "offload_page_size should be a power of two");
+        self.offload_page_size = ps;
         self
 }
lib/llm/src/block_manager/block/data.rs (3)
16-18: Doc nit: clarify fragment_block_id semantics

“The index of the block in the block set” duplicates block_id’s doc. Suggest “Returns the BlockId for the given fragment index of a fragmented block.”

69-81: Guard against invalid fragment access to avoid panics downstream

Add a lightweight debug assertion to catch invalid indices early before local implementations panic.
     fn layer_view_fragment(
         &self,
         fragment_idx: usize,
         layer_idx: usize,
         outer_idx: usize,
     ) -> BlockResult<view::LayerView<'_, S>> {
+        debug_assert!(
+            self.is_fragmented() && fragment_idx < self.num_fragments(),
+            "invalid fragment_idx {} (num_fragments: {})",
+            fragment_idx,
+            self.num_fragments()
+        );
         match self.is_local() {
             Some(views) => views.local_layer_view_fragment(fragment_idx, layer_idx, outer_idx),
             None => Err(BlockError::ViewsNotAvailableOnLogicalBlocks),
         }
     }
94-106: Same guard for mutable fragment views

Mirror the check for the mutable variant.
     fn layer_view_fragment_mut(
         &mut self,
         fragment_idx: usize,
         layer_idx: usize,
         outer_idx: usize,
     ) -> BlockResult<view::LayerViewMut<'_, S>> {
+        debug_assert!(
+            self.is_fragmented() && fragment_idx < self.num_fragments(),
+            "invalid fragment_idx {} (num_fragments: {})",
+            fragment_idx,
+            self.num_fragments()
+        );
         match self.is_local_mut() {
             Some(views) => views.local_layer_view_fragment_mut(fragment_idx, layer_idx, outer_idx),
             None => Err(BlockError::ViewsNotAvailableOnLogicalBlocks),
         }
     }
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs (2)
55-57: Remove the unused offload_page_size field or implement its functionality.

The offload_page_size field is marked with #[allow(unused)] but is being passed through the constructor and stored. If this field will be used in a future implementation, consider adding a TODO comment explaining its planned usage. Otherwise, remove it to avoid confusion.

If keeping for future use:
 pub struct KvConnectorLeader {
     slot_manager: Arc<OnceLock<ConnectorSlotManager<String>>>,
     engine_page_size: usize,
+    // TODO: Implement offload page size functionality for TRTLLM connector
     #[allow(unused)]
     offload_page_size: usize,
176-178: Consider extracting the page size alignment validation to a helper method.

The modulo checks for page alignment appear multiple times in the code. Consider extracting this to a helper method for better maintainability and consistency.
+impl KvConnectorLeader {
+    #[inline]
+    fn is_page_aligned(&self, tokens: usize) -> bool {
+        tokens % self.engine_page_size == 0
+    }
+}

 fn get_num_new_matched_tokens(
     &mut self,
     request_id: String,
     request_num_tokens: usize,
     num_computed_tokens: usize,
 ) -> anyhow::Result<(usize, bool)> {
     // ...
-    if num_computed_tokens % self.engine_page_size != 0 {
+    if !self.is_page_aligned(num_computed_tokens) {
         return Ok((0, false));
     }
     // ...
-    debug_assert!((num_computed_tokens + num_external_tokens) % self.engine_page_size == 0);
+    debug_assert!(self.is_page_aligned(num_computed_tokens + num_external_tokens));
Also applies to: 186-187, 201-202
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (1)
1439-1445: Simplify block pairs construction.

The block pairs construction using iterator chains could be simplified for better readability.
-let block_pairs: Vec<_> = src_block_ids
-    .iter()
-    .zip(
-        onboard_req
-            .dst_block_ids
-            .chunks(engine_blocks_per_offload_block),
-    )
-    .map(|(src, dst)| (SmallVec::from([*src]), SmallVec::from(dst)))
-    .collect();
+let dst_chunks = onboard_req.dst_block_ids.chunks(engine_blocks_per_offload_block);
+let block_pairs: Vec<_> = src_block_ids
+    .iter()
+    .zip(dst_chunks)
+    .map(|(src, dst)| (SmallVec::from([*src]), SmallVec::from(dst)))
+    .collect();
lib/llm/src/block_manager/block/data/local.rs (4)
11-13: Fix typos in fragmentation comment.

Minor clarity nits: “block IDx” → “block IDs”; capitalize “Offloaded”; trim redundant clause.

Apply:
-    // The special case of multiple block IDx will only happen for small KV cache blocks on device.
-    // offloaded blocks will be large and `block_idxs.len() == 1` will hold for them.
+    // The special case of multiple block IDs will only happen for small KV cache blocks on device.
+    // Offloaded blocks will be large and `block_idxs.len() == 1`.
69-69: Prefer smallvec! macro for single-element init.

More idiomatic and avoids an iterator+collect.

Apply:
-            block_idxs: [block_idx].iter().map(|x| *x).collect(),
+            block_idxs: smallvec![block_idx],
And import the macro once:
 use smallvec::SmallVec;
+use smallvec::smallvec;
75-81: Avoid unnecessary clone of WorkerID.

WorkerID appears Copy (assigned by value elsewhere). Drop .clone() to reduce noise.

Apply:
     pub(crate) fn base(&self) -> LocalBlockDataBase<S> {
         LocalBlockDataBase {
             layout: self.layout.clone(),
             block_set_idx: self.block_set_idx,
-            worker_id: self.worker_id.clone(),
+            worker_id: self.worker_id,
         }
     }
11-14: Optional: consider renaming block_idxs → block_ids.

Field stores BlockId values, not indices. A rename would improve readability and align with method names (block_id, fragment_block_id). Non-functional, but helps future contributors.

Also applies to: 21-26, 45-54, 56-82, 88-106, 127-134, 160-211, 213-239
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (2)
239-243: Tail handling: early-exit may starve small remainders.

Exiting when remaining < offload_page_size means requests with short tails won’t match until more tokens arrive or may never match for short prompts. If intended, fine; otherwise consider checking “tokens to next offload boundary” instead of full page.

Would you like a patch that computes:

let need = (self.offload_page_size - (num_computed_tokens % self.offload_page_size)) % self.offload_page_size;

return early only if remaining < need?

99-104: Add page-size details to init log for easier debugging/telemetry.

Apply:
-        tracing::info!(
-            "KvConnectorLeader initialized with worker_id: {}",
-            worker_id
-        );
+        tracing::info!(
+            "KvConnectorLeader initialized with worker_id: {}, engine_page_size: {}, offload_page_size: {}",
+            worker_id, engine_page_size, offload_page_size
+        );

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 905c920 and 3d13117.

⛔ Files ignored due to path filters (2)

Cargo.lock is excluded by !**/*.lock
lib/bindings/python/Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (22)

lib/bindings/python/Cargo.toml (1 hunks)
lib/bindings/python/rust/llm/block_manager.rs (4 hunks)
lib/bindings/python/rust/llm/block_manager/distributed/worker.rs (1 hunks)
lib/bindings/python/rust/llm/block_manager/vllm.rs (9 hunks)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (10 hunks)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (3 hunks)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (18 hunks)
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs (8 hunks)
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs (1 hunks)
lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs (3 hunks)
lib/bindings/python/src/dynamo/llm/vllm_integration/connector_leader.py (1 hunks)
lib/llm/Cargo.toml (1 hunks)
lib/llm/src/block_manager.rs (5 hunks)
lib/llm/src/block_manager/block/data.rs (5 hunks)
lib/llm/src/block_manager/block/data/local.rs (8 hunks)
lib/llm/src/block_manager/block/data/logical.rs (5 hunks)
lib/llm/src/block_manager/block/transfer/cuda.rs (1 hunks)
lib/llm/src/block_manager/config.rs (1 hunks)
lib/llm/src/block_manager/distributed/transfer.rs (4 hunks)
lib/llm/src/block_manager/distributed/utils.rs (4 hunks)
lib/llm/src/block_manager/distributed/worker.rs (7 hunks)
lib/llm/src/block_manager/state/resources.rs (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-05-29T06:20:12.901Z

Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

Applied to files:

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs

🧬 Code graph analysis (15)

lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs (4)

lib/llm/src/block_manager/block/data.rs (1)

page_size (44-44)

lib/llm/src/block_manager/block/data/local.rs (1)

page_size (147-149)

lib/llm/src/block_manager/block/data/logical.rs (1)

page_size (122-124)

lib/llm/src/block_manager/layout.rs (1)

page_size (250-252)

lib/bindings/python/src/dynamo/llm/vllm_integration/connector_leader.py (1)

lib/bindings/python/rust/llm/block_manager.rs (1)

leader (259-262)

lib/llm/src/block_manager/distributed/worker.rs (2)

lib/bindings/python/rust/llm/block_manager.rs (2)

engine_page_size (251-254)

offload_page_size (255-258)

lib/llm/src/block_manager/state/resources.rs (1)

layout_builder (84-97)

lib/bindings/python/rust/llm/block_manager/distributed/worker.rs (4)

lib/llm/src/block_manager/block/data.rs (1)

page_size (44-44)

lib/llm/src/block_manager/block/data/local.rs (1)

page_size (147-149)

lib/llm/src/block_manager/block/data/logical.rs (1)

page_size (122-124)

lib/llm/src/block_manager/layout.rs (1)

page_size (250-252)

lib/bindings/python/rust/llm/block_manager/vllm.rs (2)

lib/bindings/python/rust/llm/block_manager.rs (4)

new (70-174)

new (239-245)

engine_block_size (176-178)

offload_block_size (180-182)

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (7)

new (190-228)

new (343-379)

new (1098-1111)

new (1122-1135)

new (1145-1155)

new (1489-1494)

new (1498-1503)

lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs (4)

lib/llm/src/block_manager/block/data.rs (1)

page_size (44-44)

lib/llm/src/block_manager/block/data/local.rs (1)

page_size (147-149)

lib/llm/src/block_manager/block/data/logical.rs (1)

page_size (122-124)

lib/llm/src/block_manager/layout.rs (1)

page_size (250-252)

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (4)

lib/bindings/python/rust/llm/block_manager.rs (6)

engine_page_size (251-254)

offload_page_size (255-258)

worker_id (247-250)

leader (259-262)

new (70-174)

new (239-245)

lib/llm/src/block_manager.rs (3)

worker_id (165-167)

new (196-212)

new (245-264)

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)

new (88-203)

lib/bindings/python/src/dynamo/llm/vllm_integration/connector_leader.py (1)

KvConnectorLeader (47-204)

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)

lib/bindings/python/rust/llm/block_manager.rs (2)

engine_page_size (251-254)

offload_page_size (255-258)

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (3)

lib/bindings/python/rust/llm/block_manager/vllm.rs (4)

block_manager (76-78)

new (85-94)

new (333-357)

new (401-407)

lib/bindings/python/rust/llm/block_manager.rs (4)

engine_block_size (176-178)

offload_block_size (180-182)

new (70-174)

new (239-245)

lib/llm/src/block_manager.rs (5)

engine_block_size (135-137)

offload_block_size (140-142)

offload_block_size_ratio (145-147)

new (196-212)

new (245-264)

lib/llm/src/block_manager/block/data.rs (2)

lib/llm/src/block_manager/block/data/local.rs (6)

fragment_block_id (99-106)

is_fragmented (127-129)

num_fragments (131-133)

local_layer_view_fragment (173-186)

local_layer_view_mut (188-197)

local_layer_view_fragment_mut (199-211)

lib/llm/src/block_manager/block/data/logical.rs (3)

fragment_block_id (84-91)

is_fragmented (109-111)

num_fragments (113-115)

lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs (5)

lib/bindings/python/rust/llm/block_manager.rs (6)

engine_page_size (251-254)

offload_page_size (255-258)

worker_id (247-250)

leader (259-262)

new (70-174)

new (239-245)

lib/bindings/python/rust/llm/block_manager/vllm.rs (3)

new (85-94)

new (333-357)

new (401-407)

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (2)

new (92-183)

new (558-587)

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)

new (88-203)

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (6)

new (190-228)

new (343-379)

new (1098-1111)

new (1122-1135)

new (1145-1155)

new (1489-1494)

lib/llm/src/block_manager/distributed/utils.rs (1)

lib/llm/src/kv/layer.rs (2)

src (653-657)

dst (659-663)

lib/llm/src/block_manager/block/data/local.rs (2)

lib/llm/src/block_manager/block/data.rs (8)

worker_id (26-26)

fragment_block_id (17-17)

is_fragmented (35-35)

num_fragments (38-38)

storage_type (29-29)

local_layer_view_fragment (133-138)

local_layer_view_fragment_mut (148-153)

local_block_view (156-156)

lib/llm/src/block_manager/block/data/logical.rs (5)

worker_id (97-99)

fragment_block_id (84-91)

is_fragmented (109-111)

num_fragments (113-115)

storage_type (101-103)

lib/llm/src/block_manager/block/data/logical.rs (3)

lib/llm/src/block_manager/block/data.rs (4)

block_id (14-14)

fragment_block_id (17-17)

is_fragmented (35-35)

num_fragments (38-38)

lib/llm/src/block_manager/block/data/local.rs (4)

block_id (89-96)

fragment_block_id (99-106)

is_fragmented (127-129)

num_fragments (131-133)

lib/llm/src/block_manager/block.rs (2)

block_id (372-374)

block_id (835-839)

lib/bindings/python/rust/llm/block_manager.rs (1)

lib/llm/src/block_manager.rs (4)

engine_block_size (135-137)

offload_block_size (140-142)

new (196-212)

new (245-264)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: pre-merge-rust (lib/bindings/python)
GitHub Check: pre-merge-rust (lib/runtime/examples)
GitHub Check: pre-merge-rust (.)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (12)

lib/llm/src/block_manager/distributed/utils.rs (1)

6-6: LGTM: smallvec import.

Import is correct and feature-gated via serde in Cargo.

lib/llm/src/block_manager/block/data/logical.rs (1)

14-14: No action needed: smallvec is declared in lib/llm/Cargo.toml (v1.15.1 with serde feature).

lib/bindings/python/rust/llm/block_manager.rs (1)

176-182: Python-facing accessors for block sizes look good

Exposes the two-size model cleanly to Python. No concerns.

lib/bindings/python/rust/llm/block_manager/vllm.rs (3)

86-89: Wiring SlotManager with engine/offload sizes is correct

Good propagation of the two-size model into SlotManager.

562-565: ISL metrics split by device/host/disk look correct

Right multipliers: device uses engine size; host/disk use offload size.

292-293: Doc update aligns with two-size model

“num_new_tokens / engine_block_size” clarifies the device-side unit. Looks good.

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (1)

1196-1202: Review comment resolved: cloning VllmBlockManager is cheap (it derives Clone on an internal Arc), so no change needed.

lib/llm/src/block_manager/distributed/transfer.rs (1)

107-111: No changes needed: LocalBlockDataBase::get_data simply wraps the SmallVec of indices into LocalBlockData and delegates any fragmentation logic to LocalBlockData.

lib/llm/src/block_manager/block/data/local.rs (1)

45-54: LGTM: base → data constructor.

Clean separation via LocalBlockDataBase; cloning layout and reusing identifiers is correct.

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (3)

251-254: LGTM: alignment assert for onboarding tokens.

The modulo against offload_page_size is the right invariant after matching.

130-138: LGTM: BlockManagerBuilder wiring with dual page sizes.

Builder calls align with the new two-size model; good to see .disable_device_pool(false) explicit.

556-585: LGTM: PyO3 signature updated consistently.

Constructor and recorder path accept both page sizes; matches downstream builder usage.

coderabbitai · 2025-09-10T14:36:12Z

lib/bindings/python/rust/llm/block_manager.rs

+    offload_page_size: usize,
+    engine_page_size: usize,
    disable_device_pool: bool,


🛠️ Refactor suggestion

offload_page_size is never used; risk of silent misconfiguration

The builder stores offload_page_size but never propagates it into KvBlockManagerConfig. This makes the field effectively dead and prevents callers from controlling the offload size via the builder.

Apply this diff inside build() to derive and set the ratio from the two sizes (with validation):

@@ - let model_config = dynamo_llm::block_manager::KvManagerModelConfig::builder() + let model_config = dynamo_llm::block_manager::KvManagerModelConfig::builder() .num_layers(1) .outer_dim(1) - .page_size(self.engine_page_size) + .page_size(self.engine_page_size) .inner_dim(1) .build()?; config = config.model(model_config); + // Derive and apply offload ratio from sizes + if self.engine_page_size == 0 || self.offload_page_size == 0 { + return Err(anyhow::anyhow!("engine/offload page sizes must be > 0")); + } + if self.offload_page_size % self.engine_page_size != 0 { + return Err(anyhow::anyhow!( + "offload_page_size ({}) must be a multiple of engine_page_size ({})", + self.offload_page_size, + self.engine_page_size + )); + } + let ratio = self.offload_page_size / self.engine_page_size; + if !ratio.is_power_of_two() { + return Err(anyhow::anyhow!( + "offload/engine page-size ratio ({}) must be a power of two", + ratio + )); + } + // Requires KvBlockManagerConfigBuilder::offload_block_size_ratio(usize) + config = config.offload_block_size_ratio(ratio);

Additionally, consider validating an upper bound if required by pools (e.g., offload ≤ 1024).

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai · 2025-09-10T14:36:12Z

lib/bindings/python/rust/llm/block_manager.rs

            .num_layers(1)
            .outer_dim(1)
-            .page_size(self.page_size)
+            .page_size(self.engine_page_size)


💡 Verification agent

🧩 Analysis chain

Only engine page size is propagated; offload is ignored

build() writes model.page_size but never configures offload_block_size_ratio, so offload size falls back to the global default regardless of the builder’s offload_page_size.

Follow the ratio derivation diff in my earlier comment to ensure offload_page_size actually takes effect.

To confirm the API path exists, run:

🏁 Script executed:

#!/bin/bash # Verify the builder exposes offload_block_size_ratio and find its usage. rg -n --glob 'lib/**' 'offload_block_size_ratio' -C2 # Confirm current usage of offload_page_size and engine_page_size. rg -n --glob 'lib/**' 'offload_page_size|engine_page_size' -C2

Length of output: 38270

Propagate offload_page_size in Python binding build()
In lib/bindings/python/rust/llm/block_manager.rs around line 292, after

.page_size(self.engine_page_size)

add

.offload_page_size(self.offload_page_size)

so the builder applies the configured offload size rather than falling back to the default.

🤖 Prompt for AI Agents

In lib/bindings/python/rust/llm/block_manager.rs around line 292, the builder call stops at .page_size(self.engine_page_size) and doesn't propagate the configured offload size; add a chained .offload_page_size(self.offload_page_size) immediately after the page_size call so the builder uses the self.offload_page_size value (ensure the field exists and is in scope) instead of falling back to the default.

coderabbitai · 2025-09-10T14:36:12Z

lib/bindings/python/rust/llm/block_manager/distributed/worker.rs

        let config = KvbmWorkerConfig::builder()
            .drt(drt)
            .num_device_blocks(num_device_blocks)
-            .page_size(page_size)
+            .offload_page_size(page_size)
+            .engine_page_size(page_size)


⚠️ Potential issue

Configuration uses the same value for both page sizes

The configuration currently sets both offload_page_size and engine_page_size to the same value (page_size), which appears to contradict the PR's objective of having "the host and disk pool to use larger block sizes than the device block pool." This likely requires the Python caller to pass separate values for these parameters.

Based on the PR objectives stating that offload blocks should be larger than device blocks, the Python bindings should accept two distinct page size parameters. Consider updating the function signature to accept both engine_page_size and offload_page_size:

- #[pyo3(signature = (num_device_blocks, page_size, tensors, device_id=0, dtype_width_bytes=2, drt=None, layout_blocking=false))] + #[pyo3(signature = (num_device_blocks, engine_page_size, offload_page_size, tensors, device_id=0, dtype_width_bytes=2, drt=None, layout_blocking=false))] fn new( num_device_blocks: usize, - page_size: usize, + engine_page_size: usize, + offload_page_size: usize, tensors: Vec<Py<PyAny>>,

And update the configuration accordingly:

let config = KvbmWorkerConfig::builder() .drt(drt) .num_device_blocks(num_device_blocks) - .offload_page_size(page_size) - .engine_page_size(page_size) + .offload_page_size(offload_page_size) + .engine_page_size(engine_page_size)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let config = KvbmWorkerConfig::builder()

.drt(drt)

.num_device_blocks(num_device_blocks)

.page_size(page_size)

.offload_page_size(page_size)

.engine_page_size(page_size)

// Update the Python binding signature to accept two page sizes

#[pyo3(signature = (num_device_blocks, engine_page_size, offload_page_size, tensors, device_id=0, dtype_width_bytes=2, drt=None, layout_blocking=false))]

fn new(

num_device_blocks: usize,

engine_page_size: usize,

offload_page_size: usize,

tensors: Vec<Py<PyAny>>,

device_id: usize,

dtype_width_bytes: usize,

drt: Option<...>,

layout_blocking: bool,

) -> PyResult<Self> {

// ...

let config = KvbmWorkerConfig::builder()

.drt(drt)

.num_device_blocks(num_device_blocks)

.offload_page_size(offload_page_size)

.engine_page_size(engine_page_size)

// ...

}

🤖 Prompt for AI Agents

In lib/bindings/python/rust/llm/block_manager/distributed/worker.rs around lines 137 to 141, the code currently assigns the same page_size to both offload_page_size and engine_page_size; update the worker API and configuration to accept distinct engine_page_size and offload_page_size values so offload (host/disk) can be larger than device blocks. Change the Rust function signature (and matching Python binding) to take two separate page size parameters, use engine_page_size for .engine_page_size(...) and offload_page_size for .offload_page_size(...), and update all callers/tests to pass the appropriate values. Ensure input validation (e.g., offload >= device) or clear documentation for callers if needed.

coderabbitai · 2025-09-10T14:36:13Z

lib/bindings/python/rust/llm/block_manager/vllm.rs

+                Slot::new(tokens.into(), self.engine_block_size, salt_hash),
            );


🛠️ Refactor suggestion

Slot initialization should carry both sizes to support offload hashing

Passing only engine_block_size makes it hard for Slot to produce offload-sized sequence hashes needed for host/disk caches.

If Slot::new can be extended, pass both sizes:

- Slot::new(tokens.into(), self.engine_block_size, salt_hash), + Slot::new( + tokens.into(), + self.engine_block_size, + self.offload_block_size, + salt_hash, + ),

Otherwise, provide a method on Slot to compute offload-sized sequence hashes on demand.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Slot::new(tokens.into(), self.engine_block_size, salt_hash),

);

@@ lib/bindings/python/rust/llm/block_manager/vllm.rs

Slot::new(

tokens.into(),

self.engine_block_size,

self.offload_block_size,

salt_hash,

),

);

🤖 Prompt for AI Agents

In lib/bindings/python/rust/llm/block_manager/vllm.rs around lines 432-433, the Slot is being created with only engine_block_size which prevents Slot from producing offload-sized sequence hashes; either extend Slot::new to accept both sizes and change this call to pass engine_block_size and offload_block_size (plus salt_hash), or if modifying Slot::new is undesirable, add a method on Slot (e.g., compute_offload_sequence_hash(offload_block_size)) and call that after construction so Slot can produce offload-sized sequence hashes for host/disk caches.

coderabbitai · 2025-09-10T14:36:13Z

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs

+        debug_assert!(offload_block_size.is_power_of_two() && offload_block_size <= 1024);
+        debug_assert!(engine_block_size.is_power_of_two() && engine_block_size <= 1024);
+        let sequence = TokenBlockSequence::new(tokens, offload_block_size as u32, Some(salt_hash));


⚠️ Potential issue

Add validation for block size relationship.

The code assumes that offload_block_size is a multiple of engine_block_size, but this is only checked with debug assertions. Add a runtime validation to prevent potential issues in release builds.

let offload_block_size = block_manager.offload_block_size(); let engine_block_size = block_manager.engine_block_size(); let offload_block_size_ratio = block_manager.offload_block_size_ratio(); + +// Validate block size relationship +if offload_block_size % engine_block_size != 0 { + panic!("offload_block_size ({}) must be a multiple of engine_block_size ({})", + offload_block_size, engine_block_size); +} +if offload_block_size / engine_block_size != offload_block_size_ratio { + panic!("Inconsistent block size ratio: {} / {} != {}", + offload_block_size, engine_block_size, offload_block_size_ratio); +} + debug_assert!(offload_block_size.is_power_of_two() && offload_block_size <= 1024); debug_assert!(engine_block_size.is_power_of_two() && engine_block_size <= 1024);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

debug_assert!(offload_block_size.is_power_of_two() && offload_block_size <= 1024);

debug_assert!(engine_block_size.is_power_of_two() && engine_block_size <= 1024);

let sequence = TokenBlockSequence::new(tokens, offload_block_size as u32, Some(salt_hash));

let offload_block_size = block_manager.offload_block_size();

let engine_block_size = block_manager.engine_block_size();

let offload_block_size_ratio = block_manager.offload_block_size_ratio();

// Validate block size relationship

if offload_block_size % engine_block_size != 0 {

panic!(

"offload_block_size ({}) must be a multiple of engine_block_size ({})",

offload_block_size,

engine_block_size

);

}

if offload_block_size / engine_block_size != offload_block_size_ratio {

panic!(

"Inconsistent block size ratio: {} / {} != {}",

offload_block_size,

engine_block_size,

offload_block_size_ratio

);

}

debug_assert!(offload_block_size.is_power_of_two() && offload_block_size <= 1024);

debug_assert!(engine_block_size.is_power_of_two() && engine_block_size <= 1024);

let sequence = TokenBlockSequence::new(tokens, offload_block_size as u32, Some(salt_hash));

coderabbitai · 2025-09-10T14:36:15Z

lib/llm/src/block_manager/distributed/utils.rs

    pub from_pool: BlockTransferPool,
    pub to_pool: BlockTransferPool,
-    pub blocks: Vec<(usize, usize)>,
+    pub blocks: Vec<(SmallVec<[usize; 1]>, SmallVec<[usize; 1]>)>,


💡 Verification agent

🧩 Analysis chain

Wire format breaking change for blocks; add serde compat or version the message.

Changing blocks to Vec<(SmallVec<[usize;1]>, SmallVec<[usize;1]>)> alters the serialized shape from [[src,dst], …] to [[[...srcs],[...dsts]], …]. Any older producer/consumer (Rust/Python/TRTLLM) deserializing the old shape will break.

Two options:

Back-compat serde adapters on this field to accept both shapes (old pair and new fragments).

Bump protocol/schema version for ZMQ_TRANSFER_BLOCKS_MESSAGE and migrate all call sites atomically.

I can provide a serde adapter so we accept both without changing external callers. Would you like me to draft it?

🏁 Script executed:

#!/bin/bash # Locate (de)serializers and external users of transfer_blocks to assess impact. rg -n 'transfer_blocks|BlockTransferRequest|blocks' -C3

Length of output: 307985

Wire-format breaking change: make BlockTransferRequest.blocks backward-compatible or version the message

Changing blocks to Vec<(SmallVec<[usize;1]>, SmallVec<[usize;1]>)> changes the JSON shape from [[src,dst], …] to [[[srcs],[dsts]], …]; incoming ZMQ/serde_json deserialization will break existing producers/consumers.

Locations:

Field definition: lib/llm/src/block_manager/distributed/utils.rs:34–41 (blocks at line 37; ZMQ_TRANSFER_BLOCKS_MESSAGE const at line 11).

Deserializer (incoming message): lib/llm/src/block_manager/distributed/transfer.rs:157 (serde_json::from_slice::).

Sender/usage sites: lib/llm/src/block_manager/distributed/leader.rs:397–402 (transfer_blocks_request) and callers in lib/llm/src/block_manager/distributed.rs (e.g. around lines 178–206).

Recommended actions (choose one):

Add a serde-compatible adapter on the blocks field to accept both legacy pairs [src,dst] and the new pair-of-arrays form (use #[serde(deserialize_with = "...")] with a Visitor that accepts both shapes). This keeps wire-compatibility.

Bump the protocol/schema for ZMQ_TRANSFER_BLOCKS_MESSAGE and roll an atomic migration across all producers/consumers.

I will implement the serde adapter if you want it drafted.

🤖 Prompt for AI Agents

In lib/llm/src/block_manager/distributed/utils.rs around line 37, the blocks field type change altered the JSON wire shape and will break existing producers/consumers; add a serde-compatible adapter to deserialize both legacy pairs [src,dst] and the new pair-of-arrays form [[[srcs],[dsts]]]. Implement a deserialize_with function (in the same module or a nearby utils module) that provides a Visitor which accepts either (a) a sequence of two integers (map to SmallVec<[usize;1]> for each side) or (b) a sequence of two sequences of integers (map each inner sequence to SmallVec<[usize;1]>), and use #[serde(deserialize_with = "your_adapter")] on the blocks field so incoming ZMQ/serde_json::from_slice::<BlockTransferRequest> accepts both formats while preserving the existing type; alternatively, if you prefer schema bumping, increment the protocol version for ZMQ_TRANSFER_BLOCKS_MESSAGE and coordinate migrations across sender/receiver sites instead of adding the adapter.

coderabbitai · 2025-09-10T14:36:15Z

lib/llm/src/block_manager/distributed/worker.rs

    #[builder(default = "32")]
-    page_size: usize,
+    engine_page_size: usize,
+
+    #[builder(default = "32")]
+    offload_page_size: usize,


💡 Verification agent

🧩 Analysis chain

Consider validating the offload_block_size_ratio relationship

The configuration now has separate engine_page_size and offload_page_size fields, but based on the PR objectives and the code snippet from lib/llm/src/block_manager/state/resources.rs, there's an offload_block_size_ratio that should maintain a fixed ratio between these sizes.

🏁 Script executed:

#!/bin/bash # Check how offload_block_size_ratio is defined and used ast-grep --pattern 'offload_block_size_ratio' # Look for any validation of the relationship between engine and offload sizes rg -n "engine_page_size|offload_page_size" --type rust -C3 | grep -E "(assert|validate|check|verify)"

Length of output: 1790

Validate offload_page_size maintains the configured ratio
Add a check in the WorkerConfig builder (lib/llm/src/block_manager/distributed/worker.rs, lines 93–97) or in build() to assert that

offload_page_size == engine_page_size * offload_block_size_ratio

Currently no validation exists to enforce this relationship (see usages in lib/llm/src/block_manager.rs), so misconfigured values can produce incorrect offload block sizes.

🤖 Prompt for AI Agents

In lib/llm/src/block_manager/distributed/worker.rs around lines 93–97, the builder sets engine_page_size and offload_page_size but lacks validation that offload_page_size == engine_page_size * offload_block_size_ratio; add a check in the WorkerConfig builder's build() (or the builder validation hook) that computes expected_offload = engine_page_size * offload_block_size_ratio and compares it to offload_page_size, returning a clear Err/Result (or panic with a descriptive error) when they differ so misconfigurations cannot produce incorrect offload block sizes.

coderabbitai · 2025-09-10T14:36:15Z

lib/llm/src/block_manager/distributed/worker.rs

            };
            let num_layers = device_tensors.len();
-            let inner_dim = shape[2..].iter().product::<usize>() / config.page_size;
+            let inner_dim = shape[2..].iter().product::<usize>() / config.engine_page_size;


⚠️ Potential issue

Inconsistent page_size usage in inner_dim calculation

The inner_dim calculation uses engine_page_size for non-fully-contiguous layouts (line 177) but offload_page_size for fully-contiguous layouts (line 196). This inconsistency could lead to incorrect memory layout calculations.

Based on the layout type distinction, it appears intentional that different layouts use different page sizes. However, the logging on line 201 incorrectly shows engine_page_size when the calculation used offload_page_size. Fix the logging:

tracing::info!( "Inferred layout: num_layers={}, outer_dim={}, page_size={}, inner_dim={}", num_layers, outer_dim, - config.engine_page_size, + config.offload_page_size, inner_dim );

Additionally, consider adding a comment explaining why different page sizes are used for different layout types to prevent future confusion.

Also applies to: 196-196

🤖 Prompt for AI Agents

In lib/llm/src/block_manager/distributed/worker.rs around lines 177 and 196, the inner_dim calculation uses different page sizes for different layouts (engine_page_size for non-fully-contiguous at line 177 and offload_page_size for fully-contiguous at line 196) but the log at ~201 prints engine_page_size even when offload_page_size was used; update the log to print the correct page size depending on which branch computed inner_dim (use offload_page_size for the fully-contiguous branch and engine_page_size for the other), and add a short comment by each branch explaining why different page sizes are chosen for fully-contiguous vs non-fully-contiguous layouts to prevent future confusion.

coderabbitai · 2025-09-10T14:36:15Z

lib/llm/src/block_manager/distributed/worker.rs

            .create_layout(layout_type, device_tensors)?;

-        let layout_builder = layout_builder.clone();
+        let layout_builder = layout_builder.page_size(config.offload_page_size).clone();


🛠️ Refactor suggestion

Builder pattern mutation after build

Line 232 modifies the builder after it has already been used to build a layout on line 230. This is unusual and potentially error-prone as it violates the typical builder pattern where builders are consumed or finalized after build().

Consider creating a new builder instance for the offload configuration instead of mutating the existing one:

- let layout_builder = layout_builder.page_size(config.offload_page_size).clone(); + let offload_layout_builder = LayoutConfigBuilder::default() + .num_layers(num_layers) + .outer_dim(outer_dim) + .page_size(config.offload_page_size) + .inner_dim(inner_dim) + .dtype_width_bytes(config.dtype_width_bytes);

Then pass offload_layout_builder to the initialization functions.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let layout_builder = layout_builder.page_size(config.offload_page_size).clone();

let offload_layout_builder = LayoutConfigBuilder::default()

.num_layers(num_layers)

.outer_dim(outer_dim)

.page_size(config.offload_page_size)

.inner_dim(inner_dim)

.dtype_width_bytes(config.dtype_width_bytes);

🤖 Prompt for AI Agents

In lib/llm/src/block_manager/distributed/worker.rs around lines 230-232, the code calls build() on a layout builder and then mutates that same builder (page_size(...)) on line 232 which breaks the builder pattern; instead, create a separate builder instance for the offload configuration (e.g., clone or construct a new LayoutBuilder with the same base settings), set the offload page_size on that new builder, and pass the new offload_layout_builder into the offload initialization functions so the original builder is not mutated after build.

coderabbitai · 2025-09-10T14:36:16Z

lib/llm/src/block_manager/state/resources.rs

            .num_layers(model.num_layers)
            .outer_dim(model.outer_dim)
-            .page_size(model.page_size)
+            .page_size(model.page_size * self.config.offload_block_size_ratio)


💡 Verification agent

🧩 Analysis chain

Critical: inflates device page size as well; offload-only ratio must not affect engine/device.

layout_builder is generic and appears to seed both device and offload layouts. Multiplying here will make device pages too large, contradicting the PR goal (device small, host/disk large).

Minimal fix (keep engine/device at base size here; apply the ratio only in host/disk-specific builders/paths):

- .page_size(model.page_size * self.config.offload_block_size_ratio) + .page_size(model.page_size)

Run to find usages that expect offload-sized pages and adjust them to multiply locally:

I can draft host/disk-specific builders (e.g., layout_builder_offload) if helpful.

🏁 Script executed:

#!/bin/bash rg -n -C3 '\blayout_builder\(' lib/llm | sed -n '1,200p'

Length of output: 1849

Remove offload ratio from generic layout_builder

In lib/llm/src/block_manager/state/resources.rs (line 92), change
- .page_size(model.page_size * self.config.offload_block_size_ratio) + .page_size(model.page_size)

Apply offload_block_size_ratio only in host/disk–specific layout builders (i.e. after calling resources.layout_builder() in offload code paths) so that device pages remain at the base size.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

.page_size(model.page_size * self.config.offload_block_size_ratio)

.page_size(model.page_size)

🤖 Prompt for AI Agents

In lib/llm/src/block_manager/state/resources.rs around line 92, the generic layout_builder is applying offload_block_size_ratio to the device page size (".page_size(model.page_size * self.config.offload_block_size_ratio)"); remove that multiplication so the generic builder uses the base page size (e.g., ".page_size(model.page_size)"). Then apply the offload_block_size_ratio only in host/disk-specific layout builders—after calling resources.layout_builder() in offload code paths—by multiplying the returned layout/page_size there so device pages remain at the base size while offloaded layouts use the scaled size.

da-x requested review from a team as code owners September 10, 2025 14:20

pull-request-size bot added the size/XL label Sep 10, 2025

github-actions bot added the external-contribution Pull request is from an external contributor label Sep 10, 2025

coderabbitai bot reviewed Sep 10, 2025

View reviewed changes

statiraju requested review from oandreeva-nv, ryanolson and ziqifan617 September 10, 2025 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Draft: kvbm - bigger page size for offload block pools only #2986

Draft: kvbm - bigger page size for offload block pools only #2986

Uh oh!

da-x commented Sep 10, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

copy-pr-bot bot commented Sep 10, 2025

Uh oh!

github-actions bot commented Sep 10, 2025

Uh oh!

coderabbitai bot commented Sep 10, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 10, 2025

Uh oh!

coderabbitai bot Sep 10, 2025

Uh oh!

coderabbitai bot Sep 10, 2025

Uh oh!

coderabbitai bot Sep 10, 2025

Uh oh!

coderabbitai bot Sep 10, 2025

Uh oh!

coderabbitai bot Sep 10, 2025

Uh oh!

coderabbitai bot Sep 10, 2025

Uh oh!

coderabbitai bot Sep 10, 2025

Uh oh!

coderabbitai bot Sep 10, 2025

Uh oh!

coderabbitai bot Sep 10, 2025

Uh oh!

Uh oh!

		Slot::new(tokens.into(), self.engine_block_size, salt_hash),
		);

-        debug_assert!(offload_block_size.is_power_of_two() && offload_block_size <= 1024);
-        debug_assert!(engine_block_size.is_power_of_two() && engine_block_size <= 1024);
-        let sequence = TokenBlockSequence::new(tokens, offload_block_size as u32, Some(salt_hash));
+         let offload_block_size = block_manager.offload_block_size();
+         let engine_block_size = block_manager.engine_block_size();
+         let offload_block_size_ratio = block_manager.offload_block_size_ratio();
+         // Validate block size relationship
+         if offload_block_size % engine_block_size != 0 {
+             panic!(
+                 "offload_block_size ({}) must be a multiple of engine_block_size ({})",
+                 offload_block_size,
+                 engine_block_size
+             );
+         }
+         if offload_block_size / engine_block_size != offload_block_size_ratio {
+             panic!(
+                 "Inconsistent block size ratio: {} / {} != {}",
+                 offload_block_size,
+                 engine_block_size,
+                 offload_block_size_ratio
+             );
+         }
+         debug_assert!(offload_block_size.is_power_of_two() && offload_block_size <= 1024);
+         debug_assert!(engine_block_size.is_power_of_two() && engine_block_size <= 1024);
+         let sequence = TokenBlockSequence::new(tokens, offload_block_size as u32, Some(salt_hash));

-        let layout_builder = layout_builder.page_size(config.offload_page_size).clone();
+        let offload_layout_builder = LayoutConfigBuilder::default()
+            .num_layers(num_layers)
+            .outer_dim(outer_dim)
+            .page_size(config.offload_page_size)
+            .inner_dim(inner_dim)
+            .dtype_width_bytes(config.dtype_width_bytes);

	.page_size(model.page_size * self.config.offload_block_size_ratio)
	.page_size(model.page_size)

Draft: kvbm - bigger page size for offload block pools only #2986

Are you sure you want to change the base?

Draft: kvbm - bigger page size for offload block pools only #2986

Uh oh!

Conversation

da-x commented Sep 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Sep 10, 2025

Uh oh!

github-actions bot commented Sep 10, 2025

Uh oh!

coderabbitai bot commented Sep 10, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Pre-merge checks (2 passed, 1 warning)

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

da-x commented Sep 10, 2025 •

edited by coderabbitai bot

Loading