Skip to content

Conversation

da-x
Copy link
Contributor

@da-x da-x commented Sep 10, 2025

This change allows the host and disk pool to have larger blocks compared to the device block pool, with a fixed block size ratio between these two pool classes. During offload and onboarding of blocks, we need to unite small device blocks to large offload blocks or to split large offloaded blocks back to small device blocks.

The advantage conferred by this change: more efficient management of offloaded blocks, especifically when transferring back and forth from disk, larger IOs can always be used.

To do:

  • Fix onboarding from disk to device
  • Fix support for TRTLLM
  • Add unit tests

Summary by CodeRabbit

  • New Features

    • Separate engine and offload block/page sizes with a configurable ratio (default 32) for improved cache and I/O efficiency.
    • Fragmentation-aware block data and transfers, enabling per-fragment views and correct copy semantics across device/host/disk.
    • Enhanced connector logic aligning token/block arithmetic with engine/offload sizes; added metrics wiring.
  • Refactor

    • Updated Python bindings and leaders/workers to require engine_page_size and offload_page_size instead of a single page_size.
    • Worker configuration and layout inference now distinguish engine vs. offload sizing.
  • Chores

    • Added smallvec dependency.

This change allows the host and disk pool to have larger blocks compared
to the device block pool, with a fixed block size ratio between these
two pool classes. During offload and onboarding of blocks, we need to unite
small device blocks to large offload blocks or to split large offloaded
blocks back to small device blocks.

The advantage conferred by this change: more efficient management of
offloaded blocks, especifically when transferring back and forth from
disk, larger IOs can always be used.

To do:

- Fix onboarding from disk to device
- Fix support for TRTLLM
- Add unit tests

Signed-off-by: Dan Aloni <[email protected]>
@da-x da-x requested review from a team as code owners September 10, 2025 14:20
Copy link

copy-pr-bot bot commented Sep 10, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link

👋 Hi da-x! Thank you for contributing to ai-dynamo/dynamo.

Just a reminder: The NVIDIA Test Github Validation CI runs an essential subset of the testing framework to quickly catch errors.Your PR reviewers may elect to test the changes comprehensively before approving your changes.

🚀

@github-actions github-actions bot added the external-contribution Pull request is from an external contributor label Sep 10, 2025
Copy link
Contributor

coderabbitai bot commented Sep 10, 2025

Walkthrough

Introduces separate engine and offload block/page sizes across KV block manager, vLLM/TRT-LLM connectors, and Python bindings. Adds fragmentation support using SmallVec, refactors transfer logic to operate on fragment-aware data, and updates builders/configs and accessors accordingly. Updates Python integration to pass both sizes. Adds smallvec dependency.

Changes

Cohort / File(s) Summary
Dependencies
lib/llm/Cargo.toml, lib/bindings/python/Cargo.toml
Add dependency: smallvec = "1.15.1" with serde feature.
Core KV Block Manager config & API
lib/llm/src/block_manager.rs, lib/llm/src/block_manager/config.rs, lib/llm/src/block_manager/state/resources.rs, lib/llm/src/block_manager/distributed/worker.rs
Replace single block/page size with engine/offload sizes and a ratio; add accessors; update config (offload_block_size_ratio); adjust resource layout sizing; worker config fields renamed/added and used in layout/memory sizing.
Fragmentation & transfer refactor
lib/llm/src/block_manager/block/data.rs, lib/llm/src/block_manager/block/data/local.rs, lib/llm/src/block_manager/block/data/logical.rs, lib/llm/src/block_manager/block/transfer/cuda.rs, lib/llm/src/block_manager/distributed/utils.rs, lib/llm/src/block_manager/distributed/transfer.rs
Introduce SmallVec-backed fragmentation (multiple fragment IDs), add fragment-aware views/accessors, update CUDA copy to handle fragmented src/dst, change transfer request blocks to SmallVec pairs, refactor transfer handler to LocalBlockDataBase-based access.
Python Rust bindings: BlockManager
lib/bindings/python/rust/llm/block_manager.rs, lib/bindings/python/rust/llm/block_manager/distributed/worker.rs
Split block/page size surface into engine/offload; builder gains engine_page_size/offload_page_size; worker builder calls updated accordingly.
vLLM integration: Slot/Connector
lib/bindings/python/rust/llm/block_manager/vllm.rs, lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs, lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs, lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs, lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs
SlotManager and connectors migrate from single size to engine/offload sizes; add ratio handling; update alignment/arithmetic and policy logic; offload/onboard now chunk by engine-per-offload grouping; integrate KvbmMetrics where applicable.
TRT-LLM connector updates
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs, lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs
Replace page_size with engine_page_size and offload_page_size in leader/worker; update PyO3 signatures and builder chains.
Python integration surface
lib/bindings/python/src/dynamo/llm/vllm_integration/connector_leader.py
Passes new extra integer (offload size) argument to RustKvConnectorLeader constructor.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Py as Python (vLLM/TRT-LLM)
  participant PyBind as PyO3 Binding
  participant Bld as BlockManagerBuilder
  participant KBM as KvBlockManager
  participant Slot as SlotManager

  Note over Py,PyBind: Initialization with dual sizes
  Py->>PyBind: new(engine_page_size, offload_page_size, leader)
  PyBind->>Bld: .engine_page_size(eps).offload_page_size(ops)
  Bld->>KBM: build() with eps, ops, ratio
  KBM-->>PyBind: KvBlockManager{engine, offload, ratio}
  PyBind->>Slot: new(engine_block_size, offload_block_size)
  Slot-->>Py: ready

  rect rgba(200,230,255,0.3)
  Note right of Slot: Token/block math uses engine size on device<br/>and offload size on host/disk
  end
Loading
sequenceDiagram
  autonumber
  participant Src as Source BlockData
  participant Dst as Dest BlockData
  participant Xfer as CUDA copy_layers

  Note over Src,Dst: Fragment-aware transfer
  Xfer->>Src: is_fragmented?
  alt Src fragmented
    loop for each src fragment
      Xfer->>Dst: copy fragment -> contiguous dst region (advance dst)
    end
  else Dst fragmented
    loop for each dst fragment
      Xfer->>Dst: copy contiguous src -> dst fragment (advance src)
    end
  else
    Xfer->>Dst: single memcpy
  end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The current description does not follow the repository’s required template: it lacks the “#### Overview,” “#### Details,” “#### Where should the reviewer start?” and “#### Related Issues” sections and instead provides free-form paragraphs and a TODO list. This omission makes it unclear where reviewers should focus and which issue the PR addresses. Please update the PR description to use the repository’s template by adding section headings for Overview, Details, Where should the reviewer start, and Related Issues (with appropriate issue references), and then populate each section with the corresponding information about what was changed, why, which files to review, and any linked issues.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title “Draft: kvbm - bigger page size for offload block pools only” refers directly to a real aspect of the changeset—namely increasing offload block sizes—but includes a “Draft:” prefix and uses an internal acronym, which adds noise without clarifying the primary change. It nevertheless identifies a key part of the PR’s intent by highlighting offload block pool sizing.
Docstring Coverage ✅ Passed Docstring coverage is 84.34% which is sufficient. The required threshold is 80.00%.

Poem

A rabbit taps bytes with tiny paws,
Splitting blocks by tidy laws;
Engine nibble, offload chew,
SmallVec pockets, fragments new.
Hops through slots, transfers bright—
Two sizes dance in perfect byte.
Thump-thump: it all aligns just right.

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 22

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (6)
lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs (1)

166-176: Compute/validate sizes and avoid overflow on multiplication.

Use checked_mul and enforce offload_page_size % engine_page_size == 0 before building the config.

Apply this diff:

@@
-        let config = KvbmWorkerConfig::builder()
+        // Derive and validate sizes
+        let engine_page_size = page_size;
+        let offload_page_size = engine_page_size
+            .checked_mul(self.offload_block_size_ratio)
+            .ok_or_else(|| anyhow::anyhow!(
+                "offload_page_size overflow: engine_page_size={} ratio={}",
+                engine_page_size, self.offload_block_size_ratio
+            ))?;
+        debug_assert!(
+            offload_page_size % engine_page_size == 0,
+            "offload_page_size must be a multiple of engine_page_size"
+        );
+
+        let config = KvbmWorkerConfig::builder()
             .drt(self.drt.clone())
             .num_device_blocks(num_device_blocks)
-            .offload_page_size(page_size * self.offload_block_size_ratio)
-            .engine_page_size(page_size)
+            .offload_page_size(offload_page_size)
+            .engine_page_size(engine_page_size)
             .tensors(vllm_tensors)
             .device_id(device_id)
             .dtype_width_bytes(dtype_width_bytes)
             .barrier_id_prefix(get_barrier_id_prefix())
             .scheduler_client(Some(self.transfer_client.clone()))
             .build()?;

Also applies to: 169-171

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)

87-96: Validate engine/offload sizes and log ratio at construction.

Guard against zero/invalid sizes and non-integer ratios to prevent downstream inconsistencies.

Apply this diff:

@@
     pub fn new(
         worker_id: String,
         drt: PyDistributedRuntime,
-        engine_page_size: usize,
-        offload_page_size: usize,
+        engine_page_size: usize,
+        offload_page_size: usize,
         leader_py: PyKvbmLeader,
     ) -> Self {
@@
-        let kvbm_metrics = KvbmMetrics::new(&ns);
+        // Validate sizes upfront
+        assert!(engine_page_size > 0 && offload_page_size > 0, "page sizes must be > 0");
+        assert!(
+            offload_page_size % engine_page_size == 0,
+            "offload_page_size ({}) must be a multiple of engine_page_size ({})",
+            offload_page_size, engine_page_size
+        );
+        let ratio = offload_page_size / engine_page_size;
+        tracing::info!("engine_page_size={} offload_page_size={} ratio={}", engine_page_size, offload_page_size, ratio);
+
+        let kvbm_metrics = KvbmMetrics::new(&ns);
@@
                 let block_manager = match BlockManagerBuilder::new()
                     .worker_id(0)
                     .leader(leader_py)
-                    .engine_page_size(engine_page_size)
-                    .offload_page_size(offload_page_size)
+                    .engine_page_size(engine_page_size)
+                    .offload_page_size(offload_page_size)
                     .disable_device_pool(false)
                     .build()
                     .await

Also applies to: 147-149

lib/bindings/python/rust/llm/block_manager/vllm.rs (1)

584-682: Switch matching logic to offload-sized blocks

get_num_new_matched_tokens currently uses engine_block_size for hashing, offsets and token arithmetic, but host()/disk() pools operate on offload_block_size. Update to:

  • compute completed offload blocks: num_completed_offload_blocks = num_computed_tokens / self.offload_block_size
  • call slot.sequence_hashes_offload() instead of slot.sequence_hashes(...)
  • set search_offset = num_completed_offload_blocks
  • calculate num_new_matched_tokens = num_matched_blocks * self.offload_block_size
  • subtract self.offload_block_size in the boundary-pop case

Add a sequence_hashes_offload() API if missing.

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (1)

549-582: Add tests and document alignment logic in slot.rs
File: lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (549–582)

  • No existing tests cover the aligned_candidates calculation or offload_block_size_ratio path. Add unit tests for:
    • num_candidate_blocks divisible by offload_block_size_ratio
    • num_candidate_blocks with a non-zero remainder
    • num_candidate_blocks smaller than offload_block_size_ratio
  • Insert inline comments clarifying the math, for example:
    // Calculate the largest multiple of offload_block_size_ratio
    // that does not exceed num_candidate_blocks
    let aligned_candidates = (num_candidate_blocks / self.offload_block_size_ratio)
        * self.offload_block_size_ratio;
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (2)

112-114: Assert ratio invariants between engine/offload page sizes.

Offload must be a multiple of engine page size; assert early to catch misconfig before scheduling.

Apply:

         let kvbm_metrics = KvbmMetrics::new(&ns);
         let kvbm_metrics_clone = kvbm_metrics.clone();
+
+        debug_assert!(
+            engine_page_size > 0 && offload_page_size > 0,
+            "page sizes must be > 0"
+        );
+        debug_assert!(
+            offload_page_size % engine_page_size == 0,
+            "offload_page_size ({}) must be a multiple of engine_page_size ({})",
+            offload_page_size,
+            engine_page_size
+        );

295-301: Guard against underflow and enforce invariants in computed-tokens math.

If inputs ever drift, this can go negative or misalign silently. Add debug asserts.

Apply:

         // the second call will show num_external_tokens == 0
         // this call is just letting us know the other blocks that are being used for the remainder of the prefill
         if num_external_tokens > 0 {
+            debug_assert!(
+                !block_ids.is_empty(),
+                "num_external_tokens > 0 requires at least one device block"
+            );
+            debug_assert!(
+                num_external_tokens <= block_ids.len() * self.engine_page_size,
+                "num_external_tokens ({}) exceeds capacity of provided device blocks ({} * {})",
+                num_external_tokens,
+                block_ids.len(),
+                self.engine_page_size
+            );
             let num_computed_tokens = block_ids.len() * self.engine_page_size - num_external_tokens;
             slot.record_cached_device_tokens(num_computed_tokens);
             slot.advance_computed_position(num_computed_tokens)?;
🧹 Nitpick comments (27)
lib/bindings/python/Cargo.toml (1)

55-55: Unify smallvec version across the workspace (or gate features).

Looks fine, but please ensure this exact version/feature set matches other crates (llm also adds smallvec). Consider moving to [workspace.dependencies] to prevent drift; if not, keep both crates pinned to the same version/features to avoid duplicate resolutions.

lib/llm/Cargo.toml (1)

159-159: Match Python binding’s smallvec and consider workspace unification.

Same note here: align with lib/bindings/python to the exact version/features or promote to [workspace.dependencies] for consistency.

lib/llm/src/block_manager/config.rs (1)

210-215: Name/docs mismatch vs usage; add guardrails.

The field is named “offload_block_size_ratio” but is used to scale page_size elsewhere. Consider renaming to offload_page_size_ratio or clarifying docs to “page size,” and add upper-bound validation (or cfg-driven limit) to prevent pathological ratios.

Apply this doc tweak locally:

-    /// Ratio between offload block size and engine block size
-    /// Offload blocks are larger to enable more efficient I/O operations
+    /// Ratio between offload page size and engine page size
+    /// Offload pages are larger to enable more efficient I/O operations
lib/llm/src/block_manager/distributed/utils.rs (2)

53-61: Use smallvec! macro for clarity and tiny perf win.

Constructing a single-element SmallVec via iterator is noisy.

-            blocks: blocks
-                .into_iter()
-                .map(|(src, dst)| {
-                    (
-                        [src].iter().map(|x| *x).collect(),
-                        [dst].iter().map(|x| *x).collect(),
-                    )
-                })
-                .collect(),
+            blocks: blocks
+                .into_iter()
+                .map(|(src, dst)| (smallvec![src], smallvec![dst]))
+                .collect(),

75-83: Ditto: simplify constructor mapping with smallvec! macro.

Same simplification here.

-            blocks: blocks
-                .into_iter()
-                .map(|(src, dst)| {
-                    (
-                        [src].iter().map(|x| *x).collect(),
-                        [dst].iter().map(|x| *x).collect(),
-                    )
-                })
-                .collect(),
+            blocks: blocks
+                .into_iter()
+                .map(|(src, dst)| (smallvec![src], smallvec![dst]))
+                .collect(),
lib/llm/src/block_manager.rs (2)

114-117: Avoid storing redundant derived state or assert invariants to prevent drift

You store both engine_block_size, offload_block_size, and the ratio. Either:

  • compute offload_block_size on demand from engine_block_size * offload_block_size_ratio, or
  • assert invariants at construction (offload_block_size >= engine_block_size, divisible, matches ratio).

Given this struct is Clone and long-lived, a future mutation could desync fields.

Apply simple invariant checks during construction (see suggested diffs on new()).


135-147: Clarify accessor docs and units

Both engine_block_size() and offload_block_size() are documented as “Get the block size.” Be explicit about engine/device vs. offload/host-disk and units (bytes).

Apply:

-    /// Get the block size
+    /// Get the engine (device) block size in bytes.
     pub fn engine_block_size(&self) -> usize {
         self.engine_block_size
     }

-    /// Get the block size
+    /// Get the offload (host/disk) block size in bytes.
     pub fn offload_block_size(&self) -> usize {
         self.offload_block_size
     }

-    /// Get the offload block size ratio
+    /// Get the offload block-size ratio (offload/engine). Always >= 1.
     pub fn offload_block_size_ratio(&self) -> usize {
         self.offload_block_size_ratio
     }
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs (1)

133-135: Use distinct offload_page_size for TRT-LLM
TRT-LLM currently sets both offload_page_size and engine_page_size to the same page_size, forfeiting potential larger offload I/O. Consider plumbing a separate offload_page_size (e.g. page_size * offload_block_size_ratio) to match vLLM’s behavior.

lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs (1)

74-76: Make offload_block_size_ratio configurable and validated against leader.

A hard-coded ratio (32) risks drift with the leader’s explicit engine/offload sizes. Expose this via config/constructor or derive it from leader metadata, and validate consistency at registration time.

Would you like me to wire a setter from Python or plumb it from the leader metadata?

Also applies to: 121-121

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)

188-196: Surface sizes via metrics or getters to aid debugging.

Consider exporting engine/offload sizes (and ratio) in KvbmMetrics to catch configuration drift at runtime.

Also applies to: 190-192

lib/llm/src/block_manager/block/data/logical.rs (3)

59-59: Avoid allocation; initialize SmallVec directly.

Use SmallVec::from for the single-element case.

Apply this diff:

-            block_ids: [block_id].iter().map(|x| *x).collect(),
+            block_ids: SmallVec::from([block_id]),

75-82: Fix panic message and add backtrace for fragmented block misuse.

Current message references LocalBlockData; add backtrace for parity.

Apply this diff:

     fn block_id(&self) -> BlockId {
-        if self.block_ids.len() == 1 {
-            self.block_ids[0]
-        } else {
-            panic!("used LocalBlockData::block_id() for fragmented block");
-        }
+        if self.block_ids.len() == 1 {
+            self.block_ids[0]
+        } else {
+            tracing::error!("Backtrace: {}", std::backtrace::Backtrace::force_capture());
+            panic!("used LogicalBlockData::block_id() for fragmented block");
+        }
     }

83-91: Bounds-check fragment index and fix panic wording.

Prevent generic index panics and standardize error text.

Apply this diff:

     #[inline(always)]
     fn fragment_block_id(&self, idx: usize) -> BlockId {
-        if self.block_ids.len() != 1 {
-            self.block_ids[idx]
-        } else {
-            tracing::error!("Backtrace: {}", std::backtrace::Backtrace::force_capture());
-            panic!("used LocalBlockData::fragment_block_id() for non-fragmented block");
-        }
+        if self.block_ids.len() != 1 {
+            self.block_ids
+                .get(idx)
+                .copied()
+                .unwrap_or_else(|| {
+                    tracing::error!("Backtrace: {}", std::backtrace::Backtrace::force_capture());
+                    panic!(
+                        "LogicalBlockData::fragment_block_id() idx {} out of range (len={})",
+                        idx,
+                        self.block_ids.len()
+                    )
+                })
+        } else {
+            tracing::error!("Backtrace: {}", std::backtrace::Backtrace::force_capture());
+            panic!("used LogicalBlockData::fragment_block_id() for non-fragmented block");
+        }
     }
lib/bindings/python/rust/llm/block_manager.rs (2)

241-243: Defaults make sense but Default::default() remains footgun

new() sets good defaults, but deriving Default will still produce zeros if someone uses BlockManagerBuilder::default() directly.

Provide an explicit Default to mirror new():

impl Default for BlockManagerBuilder {
    fn default() -> Self {
        Self {
            worker_id: 0,
            leader: None,
            engine_page_size: 32,
            offload_page_size: 1024,
            disable_device_pool: false,
        }
    }
}

251-257: Setters are fine; add basic guards

Tiny guardrails help fail-fast on obvious misconfigurations.

 pub fn engine_page_size(mut self, ps: usize) -> Self {
-        self.engine_page_size = ps;
+        debug_assert!(ps.is_power_of_two(), "engine_page_size should be a power of two");
+        self.engine_page_size = ps;
         self
 }
 pub fn offload_page_size(mut self, ps: usize) -> Self {
-        self.offload_page_size = ps;
+        debug_assert!(ps.is_power_of_two(), "offload_page_size should be a power of two");
+        self.offload_page_size = ps;
         self
 }
lib/llm/src/block_manager/block/data.rs (3)

16-18: Doc nit: clarify fragment_block_id semantics

“The index of the block in the block set” duplicates block_id’s doc. Suggest “Returns the BlockId for the given fragment index of a fragmented block.”


69-81: Guard against invalid fragment access to avoid panics downstream

Add a lightweight debug assertion to catch invalid indices early before local implementations panic.

     fn layer_view_fragment(
         &self,
         fragment_idx: usize,
         layer_idx: usize,
         outer_idx: usize,
     ) -> BlockResult<view::LayerView<'_, S>> {
+        debug_assert!(
+            self.is_fragmented() && fragment_idx < self.num_fragments(),
+            "invalid fragment_idx {} (num_fragments: {})",
+            fragment_idx,
+            self.num_fragments()
+        );
         match self.is_local() {
             Some(views) => views.local_layer_view_fragment(fragment_idx, layer_idx, outer_idx),
             None => Err(BlockError::ViewsNotAvailableOnLogicalBlocks),
         }
     }

94-106: Same guard for mutable fragment views

Mirror the check for the mutable variant.

     fn layer_view_fragment_mut(
         &mut self,
         fragment_idx: usize,
         layer_idx: usize,
         outer_idx: usize,
     ) -> BlockResult<view::LayerViewMut<'_, S>> {
+        debug_assert!(
+            self.is_fragmented() && fragment_idx < self.num_fragments(),
+            "invalid fragment_idx {} (num_fragments: {})",
+            fragment_idx,
+            self.num_fragments()
+        );
         match self.is_local_mut() {
             Some(views) => views.local_layer_view_fragment_mut(fragment_idx, layer_idx, outer_idx),
             None => Err(BlockError::ViewsNotAvailableOnLogicalBlocks),
         }
     }
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs (2)

55-57: Remove the unused offload_page_size field or implement its functionality.

The offload_page_size field is marked with #[allow(unused)] but is being passed through the constructor and stored. If this field will be used in a future implementation, consider adding a TODO comment explaining its planned usage. Otherwise, remove it to avoid confusion.

If keeping for future use:

 pub struct KvConnectorLeader {
     slot_manager: Arc<OnceLock<ConnectorSlotManager<String>>>,
     engine_page_size: usize,
+    // TODO: Implement offload page size functionality for TRTLLM connector
     #[allow(unused)]
     offload_page_size: usize,

176-178: Consider extracting the page size alignment validation to a helper method.

The modulo checks for page alignment appear multiple times in the code. Consider extracting this to a helper method for better maintainability and consistency.

+impl KvConnectorLeader {
+    #[inline]
+    fn is_page_aligned(&self, tokens: usize) -> bool {
+        tokens % self.engine_page_size == 0
+    }
+}

 fn get_num_new_matched_tokens(
     &mut self,
     request_id: String,
     request_num_tokens: usize,
     num_computed_tokens: usize,
 ) -> anyhow::Result<(usize, bool)> {
     // ...
-    if num_computed_tokens % self.engine_page_size != 0 {
+    if !self.is_page_aligned(num_computed_tokens) {
         return Ok((0, false));
     }
     // ...
-    debug_assert!((num_computed_tokens + num_external_tokens) % self.engine_page_size == 0);
+    debug_assert!(self.is_page_aligned(num_computed_tokens + num_external_tokens));

Also applies to: 186-187, 201-202

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (1)

1439-1445: Simplify block pairs construction.

The block pairs construction using iterator chains could be simplified for better readability.

-let block_pairs: Vec<_> = src_block_ids
-    .iter()
-    .zip(
-        onboard_req
-            .dst_block_ids
-            .chunks(engine_blocks_per_offload_block),
-    )
-    .map(|(src, dst)| (SmallVec::from([*src]), SmallVec::from(dst)))
-    .collect();
+let dst_chunks = onboard_req.dst_block_ids.chunks(engine_blocks_per_offload_block);
+let block_pairs: Vec<_> = src_block_ids
+    .iter()
+    .zip(dst_chunks)
+    .map(|(src, dst)| (SmallVec::from([*src]), SmallVec::from(dst)))
+    .collect();
lib/llm/src/block_manager/block/data/local.rs (4)

11-13: Fix typos in fragmentation comment.

Minor clarity nits: “block IDx” → “block IDs”; capitalize “Offloaded”; trim redundant clause.

Apply:

-    // The special case of multiple block IDx will only happen for small KV cache blocks on device.
-    // offloaded blocks will be large and `block_idxs.len() == 1` will hold for them.
+    // The special case of multiple block IDs will only happen for small KV cache blocks on device.
+    // Offloaded blocks will be large and `block_idxs.len() == 1`.

69-69: Prefer smallvec! macro for single-element init.

More idiomatic and avoids an iterator+collect.

Apply:

-            block_idxs: [block_idx].iter().map(|x| *x).collect(),
+            block_idxs: smallvec![block_idx],

And import the macro once:

 use smallvec::SmallVec;
+use smallvec::smallvec;

75-81: Avoid unnecessary clone of WorkerID.

WorkerID appears Copy (assigned by value elsewhere). Drop .clone() to reduce noise.

Apply:

     pub(crate) fn base(&self) -> LocalBlockDataBase<S> {
         LocalBlockDataBase {
             layout: self.layout.clone(),
             block_set_idx: self.block_set_idx,
-            worker_id: self.worker_id.clone(),
+            worker_id: self.worker_id,
         }
     }

11-14: Optional: consider renaming block_idxsblock_ids.

Field stores BlockId values, not indices. A rename would improve readability and align with method names (block_id, fragment_block_id). Non-functional, but helps future contributors.

Also applies to: 21-26, 45-54, 56-82, 88-106, 127-134, 160-211, 213-239

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (2)

239-243: Tail handling: early-exit may starve small remainders.

Exiting when remaining < offload_page_size means requests with short tails won’t match until more tokens arrive or may never match for short prompts. If intended, fine; otherwise consider checking “tokens to next offload boundary” instead of full page.

Would you like a patch that computes:

  • let need = (self.offload_page_size - (num_computed_tokens % self.offload_page_size)) % self.offload_page_size;
  • return early only if remaining < need?

99-104: Add page-size details to init log for easier debugging/telemetry.

Apply:

-        tracing::info!(
-            "KvConnectorLeader initialized with worker_id: {}",
-            worker_id
-        );
+        tracing::info!(
+            "KvConnectorLeader initialized with worker_id: {}, engine_page_size: {}, offload_page_size: {}",
+            worker_id, engine_page_size, offload_page_size
+        );
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 905c920 and 3d13117.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • lib/bindings/python/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (22)
  • lib/bindings/python/Cargo.toml (1 hunks)
  • lib/bindings/python/rust/llm/block_manager.rs (4 hunks)
  • lib/bindings/python/rust/llm/block_manager/distributed/worker.rs (1 hunks)
  • lib/bindings/python/rust/llm/block_manager/vllm.rs (9 hunks)
  • lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (10 hunks)
  • lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (3 hunks)
  • lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (18 hunks)
  • lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs (8 hunks)
  • lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs (1 hunks)
  • lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs (3 hunks)
  • lib/bindings/python/src/dynamo/llm/vllm_integration/connector_leader.py (1 hunks)
  • lib/llm/Cargo.toml (1 hunks)
  • lib/llm/src/block_manager.rs (5 hunks)
  • lib/llm/src/block_manager/block/data.rs (5 hunks)
  • lib/llm/src/block_manager/block/data/local.rs (8 hunks)
  • lib/llm/src/block_manager/block/data/logical.rs (5 hunks)
  • lib/llm/src/block_manager/block/transfer/cuda.rs (1 hunks)
  • lib/llm/src/block_manager/config.rs (1 hunks)
  • lib/llm/src/block_manager/distributed/transfer.rs (4 hunks)
  • lib/llm/src/block_manager/distributed/utils.rs (4 hunks)
  • lib/llm/src/block_manager/distributed/worker.rs (7 hunks)
  • lib/llm/src/block_manager/state/resources.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-05-29T06:20:12.901Z
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

Applied to files:

  • lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs
🧬 Code graph analysis (15)
lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs (4)
lib/llm/src/block_manager/block/data.rs (1)
  • page_size (44-44)
lib/llm/src/block_manager/block/data/local.rs (1)
  • page_size (147-149)
lib/llm/src/block_manager/block/data/logical.rs (1)
  • page_size (122-124)
lib/llm/src/block_manager/layout.rs (1)
  • page_size (250-252)
lib/bindings/python/src/dynamo/llm/vllm_integration/connector_leader.py (1)
lib/bindings/python/rust/llm/block_manager.rs (1)
  • leader (259-262)
lib/llm/src/block_manager/distributed/worker.rs (2)
lib/bindings/python/rust/llm/block_manager.rs (2)
  • engine_page_size (251-254)
  • offload_page_size (255-258)
lib/llm/src/block_manager/state/resources.rs (1)
  • layout_builder (84-97)
lib/bindings/python/rust/llm/block_manager/distributed/worker.rs (4)
lib/llm/src/block_manager/block/data.rs (1)
  • page_size (44-44)
lib/llm/src/block_manager/block/data/local.rs (1)
  • page_size (147-149)
lib/llm/src/block_manager/block/data/logical.rs (1)
  • page_size (122-124)
lib/llm/src/block_manager/layout.rs (1)
  • page_size (250-252)
lib/bindings/python/rust/llm/block_manager/vllm.rs (2)
lib/bindings/python/rust/llm/block_manager.rs (4)
  • new (70-174)
  • new (239-245)
  • engine_block_size (176-178)
  • offload_block_size (180-182)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (7)
  • new (190-228)
  • new (343-379)
  • new (1098-1111)
  • new (1122-1135)
  • new (1145-1155)
  • new (1489-1494)
  • new (1498-1503)
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs (4)
lib/llm/src/block_manager/block/data.rs (1)
  • page_size (44-44)
lib/llm/src/block_manager/block/data/local.rs (1)
  • page_size (147-149)
lib/llm/src/block_manager/block/data/logical.rs (1)
  • page_size (122-124)
lib/llm/src/block_manager/layout.rs (1)
  • page_size (250-252)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (4)
lib/bindings/python/rust/llm/block_manager.rs (6)
  • engine_page_size (251-254)
  • offload_page_size (255-258)
  • worker_id (247-250)
  • leader (259-262)
  • new (70-174)
  • new (239-245)
lib/llm/src/block_manager.rs (3)
  • worker_id (165-167)
  • new (196-212)
  • new (245-264)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)
  • new (88-203)
lib/bindings/python/src/dynamo/llm/vllm_integration/connector_leader.py (1)
  • KvConnectorLeader (47-204)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)
lib/bindings/python/rust/llm/block_manager.rs (2)
  • engine_page_size (251-254)
  • offload_page_size (255-258)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (3)
lib/bindings/python/rust/llm/block_manager/vllm.rs (4)
  • block_manager (76-78)
  • new (85-94)
  • new (333-357)
  • new (401-407)
lib/bindings/python/rust/llm/block_manager.rs (4)
  • engine_block_size (176-178)
  • offload_block_size (180-182)
  • new (70-174)
  • new (239-245)
lib/llm/src/block_manager.rs (5)
  • engine_block_size (135-137)
  • offload_block_size (140-142)
  • offload_block_size_ratio (145-147)
  • new (196-212)
  • new (245-264)
lib/llm/src/block_manager/block/data.rs (2)
lib/llm/src/block_manager/block/data/local.rs (6)
  • fragment_block_id (99-106)
  • is_fragmented (127-129)
  • num_fragments (131-133)
  • local_layer_view_fragment (173-186)
  • local_layer_view_mut (188-197)
  • local_layer_view_fragment_mut (199-211)
lib/llm/src/block_manager/block/data/logical.rs (3)
  • fragment_block_id (84-91)
  • is_fragmented (109-111)
  • num_fragments (113-115)
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs (5)
lib/bindings/python/rust/llm/block_manager.rs (6)
  • engine_page_size (251-254)
  • offload_page_size (255-258)
  • worker_id (247-250)
  • leader (259-262)
  • new (70-174)
  • new (239-245)
lib/bindings/python/rust/llm/block_manager/vllm.rs (3)
  • new (85-94)
  • new (333-357)
  • new (401-407)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (2)
  • new (92-183)
  • new (558-587)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)
  • new (88-203)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (6)
  • new (190-228)
  • new (343-379)
  • new (1098-1111)
  • new (1122-1135)
  • new (1145-1155)
  • new (1489-1494)
lib/llm/src/block_manager/distributed/utils.rs (1)
lib/llm/src/kv/layer.rs (2)
  • src (653-657)
  • dst (659-663)
lib/llm/src/block_manager/block/data/local.rs (2)
lib/llm/src/block_manager/block/data.rs (8)
  • worker_id (26-26)
  • fragment_block_id (17-17)
  • is_fragmented (35-35)
  • num_fragments (38-38)
  • storage_type (29-29)
  • local_layer_view_fragment (133-138)
  • local_layer_view_fragment_mut (148-153)
  • local_block_view (156-156)
lib/llm/src/block_manager/block/data/logical.rs (5)
  • worker_id (97-99)
  • fragment_block_id (84-91)
  • is_fragmented (109-111)
  • num_fragments (113-115)
  • storage_type (101-103)
lib/llm/src/block_manager/block/data/logical.rs (3)
lib/llm/src/block_manager/block/data.rs (4)
  • block_id (14-14)
  • fragment_block_id (17-17)
  • is_fragmented (35-35)
  • num_fragments (38-38)
lib/llm/src/block_manager/block/data/local.rs (4)
  • block_id (89-96)
  • fragment_block_id (99-106)
  • is_fragmented (127-129)
  • num_fragments (131-133)
lib/llm/src/block_manager/block.rs (2)
  • block_id (372-374)
  • block_id (835-839)
lib/bindings/python/rust/llm/block_manager.rs (1)
lib/llm/src/block_manager.rs (4)
  • engine_block_size (135-137)
  • offload_block_size (140-142)
  • new (196-212)
  • new (245-264)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: pre-merge-rust (lib/bindings/python)
  • GitHub Check: pre-merge-rust (lib/runtime/examples)
  • GitHub Check: pre-merge-rust (.)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (12)
lib/llm/src/block_manager/distributed/utils.rs (1)

6-6: LGTM: smallvec import.

Import is correct and feature-gated via serde in Cargo.

lib/llm/src/block_manager/block/data/logical.rs (1)

14-14: No action needed: smallvec is declared in lib/llm/Cargo.toml (v1.15.1 with serde feature).

lib/bindings/python/rust/llm/block_manager.rs (1)

176-182: Python-facing accessors for block sizes look good

Exposes the two-size model cleanly to Python. No concerns.

lib/bindings/python/rust/llm/block_manager/vllm.rs (3)

86-89: Wiring SlotManager with engine/offload sizes is correct

Good propagation of the two-size model into SlotManager.


562-565: ISL metrics split by device/host/disk look correct

Right multipliers: device uses engine size; host/disk use offload size.


292-293: Doc update aligns with two-size model

“num_new_tokens / engine_block_size” clarifies the device-side unit. Looks good.

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (1)

1196-1202: Review comment resolved: cloning VllmBlockManager is cheap (it derives Clone on an internal Arc), so no change needed.

lib/llm/src/block_manager/distributed/transfer.rs (1)

107-111: No changes needed: LocalBlockDataBase::get_data simply wraps the SmallVec of indices into LocalBlockData and delegates any fragmentation logic to LocalBlockData.

lib/llm/src/block_manager/block/data/local.rs (1)

45-54: LGTM: base → data constructor.

Clean separation via LocalBlockDataBase; cloning layout and reusing identifiers is correct.

lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (3)

251-254: LGTM: alignment assert for onboarding tokens.

The modulo against offload_page_size is the right invariant after matching.


130-138: LGTM: BlockManagerBuilder wiring with dual page sizes.

Builder calls align with the new two-size model; good to see .disable_device_pool(false) explicit.


556-585: LGTM: PyO3 signature updated consistently.

Constructor and recorder path accept both page sizes; matches downstream builder usage.

Comment on lines +233 to 235
offload_page_size: usize,
engine_page_size: usize,
disable_device_pool: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

offload_page_size is never used; risk of silent misconfiguration

The builder stores offload_page_size but never propagates it into KvBlockManagerConfig. This makes the field effectively dead and prevents callers from controlling the offload size via the builder.

Apply this diff inside build() to derive and set the ratio from the two sizes (with validation):

@@
-        let model_config = dynamo_llm::block_manager::KvManagerModelConfig::builder()
+        let model_config = dynamo_llm::block_manager::KvManagerModelConfig::builder()
             .num_layers(1)
             .outer_dim(1)
-            .page_size(self.engine_page_size)
+            .page_size(self.engine_page_size)
             .inner_dim(1)
             .build()?;
 
         config = config.model(model_config);
 
+        // Derive and apply offload ratio from sizes
+        if self.engine_page_size == 0 || self.offload_page_size == 0 {
+            return Err(anyhow::anyhow!("engine/offload page sizes must be > 0"));
+        }
+        if self.offload_page_size % self.engine_page_size != 0 {
+            return Err(anyhow::anyhow!(
+                "offload_page_size ({}) must be a multiple of engine_page_size ({})",
+                self.offload_page_size,
+                self.engine_page_size
+            ));
+        }
+        let ratio = self.offload_page_size / self.engine_page_size;
+        if !ratio.is_power_of_two() {
+            return Err(anyhow::anyhow!(
+                "offload/engine page-size ratio ({}) must be a power of two",
+                ratio
+            ));
+        }
+        // Requires KvBlockManagerConfigBuilder::offload_block_size_ratio(usize)
+        config = config.offload_block_size_ratio(ratio);

Additionally, consider validating an upper bound if required by pools (e.g., offload ≤ 1024).

Committable suggestion skipped: line range outside the PR's diff.

.num_layers(1)
.outer_dim(1)
.page_size(self.page_size)
.page_size(self.engine_page_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Only engine page size is propagated; offload is ignored

build() writes model.page_size but never configures offload_block_size_ratio, so offload size falls back to the global default regardless of the builder’s offload_page_size.

Follow the ratio derivation diff in my earlier comment to ensure offload_page_size actually takes effect.

To confirm the API path exists, run:


🏁 Script executed:

#!/bin/bash
# Verify the builder exposes offload_block_size_ratio and find its usage.
rg -n --glob 'lib/**' 'offload_block_size_ratio' -C2
# Confirm current usage of offload_page_size and engine_page_size.
rg -n --glob 'lib/**' 'offload_page_size|engine_page_size' -C2

Length of output: 38270


Propagate offload_page_size in Python binding build()
In lib/bindings/python/rust/llm/block_manager.rs around line 292, after

    .page_size(self.engine_page_size)

add

    .offload_page_size(self.offload_page_size)

so the builder applies the configured offload size rather than falling back to the default.

🤖 Prompt for AI Agents
In lib/bindings/python/rust/llm/block_manager.rs around line 292, the builder
call stops at .page_size(self.engine_page_size) and doesn't propagate the
configured offload size; add a chained
.offload_page_size(self.offload_page_size) immediately after the page_size call
so the builder uses the self.offload_page_size value (ensure the field exists
and is in scope) instead of falling back to the default.

Comment on lines 137 to +141
let config = KvbmWorkerConfig::builder()
.drt(drt)
.num_device_blocks(num_device_blocks)
.page_size(page_size)
.offload_page_size(page_size)
.engine_page_size(page_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Configuration uses the same value for both page sizes

The configuration currently sets both offload_page_size and engine_page_size to the same value (page_size), which appears to contradict the PR's objective of having "the host and disk pool to use larger block sizes than the device block pool." This likely requires the Python caller to pass separate values for these parameters.

Based on the PR objectives stating that offload blocks should be larger than device blocks, the Python bindings should accept two distinct page size parameters. Consider updating the function signature to accept both engine_page_size and offload_page_size:

-    #[pyo3(signature = (num_device_blocks, page_size, tensors, device_id=0, dtype_width_bytes=2, drt=None, layout_blocking=false))]
+    #[pyo3(signature = (num_device_blocks, engine_page_size, offload_page_size, tensors, device_id=0, dtype_width_bytes=2, drt=None, layout_blocking=false))]
     fn new(
         num_device_blocks: usize,
-        page_size: usize,
+        engine_page_size: usize,
+        offload_page_size: usize,
         tensors: Vec<Py<PyAny>>,

And update the configuration accordingly:

         let config = KvbmWorkerConfig::builder()
             .drt(drt)
             .num_device_blocks(num_device_blocks)
-            .offload_page_size(page_size)
-            .engine_page_size(page_size)
+            .offload_page_size(offload_page_size)
+            .engine_page_size(engine_page_size)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let config = KvbmWorkerConfig::builder()
.drt(drt)
.num_device_blocks(num_device_blocks)
.page_size(page_size)
.offload_page_size(page_size)
.engine_page_size(page_size)
// Update the Python binding signature to accept two page sizes
#[pyo3(signature = (num_device_blocks, engine_page_size, offload_page_size, tensors, device_id=0, dtype_width_bytes=2, drt=None, layout_blocking=false))]
fn new(
num_device_blocks: usize,
engine_page_size: usize,
offload_page_size: usize,
tensors: Vec<Py<PyAny>>,
device_id: usize,
dtype_width_bytes: usize,
drt: Option<...>,
layout_blocking: bool,
) -> PyResult<Self> {
// ...
let config = KvbmWorkerConfig::builder()
.drt(drt)
.num_device_blocks(num_device_blocks)
.offload_page_size(offload_page_size)
.engine_page_size(engine_page_size)
// ...
}
🤖 Prompt for AI Agents
In lib/bindings/python/rust/llm/block_manager/distributed/worker.rs around lines
137 to 141, the code currently assigns the same page_size to both
offload_page_size and engine_page_size; update the worker API and configuration
to accept distinct engine_page_size and offload_page_size values so offload
(host/disk) can be larger than device blocks. Change the Rust function signature
(and matching Python binding) to take two separate page size parameters, use
engine_page_size for .engine_page_size(...) and offload_page_size for
.offload_page_size(...), and update all callers/tests to pass the appropriate
values. Ensure input validation (e.g., offload >= device) or clear documentation
for callers if needed.

Comment on lines +432 to 433
Slot::new(tokens.into(), self.engine_block_size, salt_hash),
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Slot initialization should carry both sizes to support offload hashing

Passing only engine_block_size makes it hard for Slot to produce offload-sized sequence hashes needed for host/disk caches.

If Slot::new can be extended, pass both sizes:

-                Slot::new(tokens.into(), self.engine_block_size, salt_hash),
+                Slot::new(
+                    tokens.into(),
+                    self.engine_block_size,
+                    self.offload_block_size,
+                    salt_hash,
+                ),

Otherwise, provide a method on Slot to compute offload-sized sequence hashes on demand.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Slot::new(tokens.into(), self.engine_block_size, salt_hash),
);
@@ lib/bindings/python/rust/llm/block_manager/vllm.rs
Slot::new(
tokens.into(),
self.engine_block_size,
self.offload_block_size,
salt_hash,
),
);
🤖 Prompt for AI Agents
In lib/bindings/python/rust/llm/block_manager/vllm.rs around lines 432-433, the
Slot is being created with only engine_block_size which prevents Slot from
producing offload-sized sequence hashes; either extend Slot::new to accept both
sizes and change this call to pass engine_block_size and offload_block_size
(plus salt_hash), or if modifying Slot::new is undesirable, add a method on Slot
(e.g., compute_offload_sequence_hash(offload_block_size)) and call that after
construction so Slot can produce offload-sized sequence hashes for host/disk
caches.

Comment on lines +354 to +356
debug_assert!(offload_block_size.is_power_of_two() && offload_block_size <= 1024);
debug_assert!(engine_block_size.is_power_of_two() && engine_block_size <= 1024);
let sequence = TokenBlockSequence::new(tokens, offload_block_size as u32, Some(salt_hash));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add validation for block size relationship.

The code assumes that offload_block_size is a multiple of engine_block_size, but this is only checked with debug assertions. Add a runtime validation to prevent potential issues in release builds.

 let offload_block_size = block_manager.offload_block_size();
 let engine_block_size = block_manager.engine_block_size();
 let offload_block_size_ratio = block_manager.offload_block_size_ratio();
+
+// Validate block size relationship
+if offload_block_size % engine_block_size != 0 {
+    panic!("offload_block_size ({}) must be a multiple of engine_block_size ({})", 
+           offload_block_size, engine_block_size);
+}
+if offload_block_size / engine_block_size != offload_block_size_ratio {
+    panic!("Inconsistent block size ratio: {} / {} != {}", 
+           offload_block_size, engine_block_size, offload_block_size_ratio);
+}
+
 debug_assert!(offload_block_size.is_power_of_two() && offload_block_size <= 1024);
 debug_assert!(engine_block_size.is_power_of_two() && engine_block_size <= 1024);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
debug_assert!(offload_block_size.is_power_of_two() && offload_block_size <= 1024);
debug_assert!(engine_block_size.is_power_of_two() && engine_block_size <= 1024);
let sequence = TokenBlockSequence::new(tokens, offload_block_size as u32, Some(salt_hash));
let offload_block_size = block_manager.offload_block_size();
let engine_block_size = block_manager.engine_block_size();
let offload_block_size_ratio = block_manager.offload_block_size_ratio();
// Validate block size relationship
if offload_block_size % engine_block_size != 0 {
panic!(
"offload_block_size ({}) must be a multiple of engine_block_size ({})",
offload_block_size,
engine_block_size
);
}
if offload_block_size / engine_block_size != offload_block_size_ratio {
panic!(
"Inconsistent block size ratio: {} / {} != {}",
offload_block_size,
engine_block_size,
offload_block_size_ratio
);
}
debug_assert!(offload_block_size.is_power_of_two() && offload_block_size <= 1024);
debug_assert!(engine_block_size.is_power_of_two() && engine_block_size <= 1024);
let sequence = TokenBlockSequence::new(tokens, offload_block_size as u32, Some(salt_hash));

pub from_pool: BlockTransferPool,
pub to_pool: BlockTransferPool,
pub blocks: Vec<(usize, usize)>,
pub blocks: Vec<(SmallVec<[usize; 1]>, SmallVec<[usize; 1]>)>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Wire format breaking change for blocks; add serde compat or version the message.

Changing blocks to Vec<(SmallVec<[usize;1]>, SmallVec<[usize;1]>)> alters the serialized shape from [[src,dst], …] to [[[...srcs],[...dsts]], …]. Any older producer/consumer (Rust/Python/TRTLLM) deserializing the old shape will break.

Two options:

  • Back-compat serde adapters on this field to accept both shapes (old pair and new fragments).
  • Bump protocol/schema version for ZMQ_TRANSFER_BLOCKS_MESSAGE and migrate all call sites atomically.

I can provide a serde adapter so we accept both without changing external callers. Would you like me to draft it?


🏁 Script executed:

#!/bin/bash
# Locate (de)serializers and external users of transfer_blocks to assess impact.
rg -n 'transfer_blocks|BlockTransferRequest|blocks' -C3

Length of output: 307985


Wire-format breaking change: make BlockTransferRequest.blocks backward-compatible or version the message

Changing blocks to Vec<(SmallVec<[usize;1]>, SmallVec<[usize;1]>)> changes the JSON shape from [[src,dst], …] to [[[srcs],[dsts]], …]; incoming ZMQ/serde_json deserialization will break existing producers/consumers.

Locations:

  • Field definition: lib/llm/src/block_manager/distributed/utils.rs:34–41 (blocks at line 37; ZMQ_TRANSFER_BLOCKS_MESSAGE const at line 11).
  • Deserializer (incoming message): lib/llm/src/block_manager/distributed/transfer.rs:157 (serde_json::from_slice::).
  • Sender/usage sites: lib/llm/src/block_manager/distributed/leader.rs:397–402 (transfer_blocks_request) and callers in lib/llm/src/block_manager/distributed.rs (e.g. around lines 178–206).

Recommended actions (choose one):

  • Add a serde-compatible adapter on the blocks field to accept both legacy pairs [src,dst] and the new pair-of-arrays form (use #[serde(deserialize_with = "...")] with a Visitor that accepts both shapes). This keeps wire-compatibility.
  • Bump the protocol/schema for ZMQ_TRANSFER_BLOCKS_MESSAGE and roll an atomic migration across all producers/consumers.

I will implement the serde adapter if you want it drafted.

🤖 Prompt for AI Agents
In lib/llm/src/block_manager/distributed/utils.rs around line 37, the blocks
field type change altered the JSON wire shape and will break existing
producers/consumers; add a serde-compatible adapter to deserialize both legacy
pairs [src,dst] and the new pair-of-arrays form [[[srcs],[dsts]]]. Implement a
deserialize_with function (in the same module or a nearby utils module) that
provides a Visitor which accepts either (a) a sequence of two integers (map to
SmallVec<[usize;1]> for each side) or (b) a sequence of two sequences of
integers (map each inner sequence to SmallVec<[usize;1]>), and use
#[serde(deserialize_with = "your_adapter")] on the blocks field so incoming
ZMQ/serde_json::from_slice::<BlockTransferRequest> accepts both formats while
preserving the existing type; alternatively, if you prefer schema bumping,
increment the protocol version for ZMQ_TRANSFER_BLOCKS_MESSAGE and coordinate
migrations across sender/receiver sites instead of adding the adapter.

Comment on lines 93 to +97
#[builder(default = "32")]
page_size: usize,
engine_page_size: usize,

#[builder(default = "32")]
offload_page_size: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Consider validating the offload_block_size_ratio relationship

The configuration now has separate engine_page_size and offload_page_size fields, but based on the PR objectives and the code snippet from lib/llm/src/block_manager/state/resources.rs, there's an offload_block_size_ratio that should maintain a fixed ratio between these sizes.


🏁 Script executed:

#!/bin/bash
# Check how offload_block_size_ratio is defined and used
ast-grep --pattern 'offload_block_size_ratio'

# Look for any validation of the relationship between engine and offload sizes
rg -n "engine_page_size|offload_page_size" --type rust -C3 | grep -E "(assert|validate|check|verify)"

Length of output: 1790


Validate offload_page_size maintains the configured ratio
Add a check in the WorkerConfig builder (lib/llm/src/block_manager/distributed/worker.rs, lines 93–97) or in build() to assert that

offload_page_size == engine_page_size * offload_block_size_ratio

Currently no validation exists to enforce this relationship (see usages in lib/llm/src/block_manager.rs), so misconfigured values can produce incorrect offload block sizes.

🤖 Prompt for AI Agents
In lib/llm/src/block_manager/distributed/worker.rs around lines 93–97, the
builder sets engine_page_size and offload_page_size but lacks validation that
offload_page_size == engine_page_size * offload_block_size_ratio; add a check in
the WorkerConfig builder's build() (or the builder validation hook) that
computes expected_offload = engine_page_size * offload_block_size_ratio and
compares it to offload_page_size, returning a clear Err/Result (or panic with a
descriptive error) when they differ so misconfigurations cannot produce
incorrect offload block sizes.

};
let num_layers = device_tensors.len();
let inner_dim = shape[2..].iter().product::<usize>() / config.page_size;
let inner_dim = shape[2..].iter().product::<usize>() / config.engine_page_size;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Inconsistent page_size usage in inner_dim calculation

The inner_dim calculation uses engine_page_size for non-fully-contiguous layouts (line 177) but offload_page_size for fully-contiguous layouts (line 196). This inconsistency could lead to incorrect memory layout calculations.

Based on the layout type distinction, it appears intentional that different layouts use different page sizes. However, the logging on line 201 incorrectly shows engine_page_size when the calculation used offload_page_size. Fix the logging:

             tracing::info!(
                 "Inferred layout: num_layers={}, outer_dim={}, page_size={}, inner_dim={}",
                 num_layers,
                 outer_dim,
-                config.engine_page_size,
+                config.offload_page_size,
                 inner_dim
             );

Additionally, consider adding a comment explaining why different page sizes are used for different layout types to prevent future confusion.

Also applies to: 196-196

🤖 Prompt for AI Agents
In lib/llm/src/block_manager/distributed/worker.rs around lines 177 and 196, the
inner_dim calculation uses different page sizes for different layouts
(engine_page_size for non-fully-contiguous at line 177 and offload_page_size for
fully-contiguous at line 196) but the log at ~201 prints engine_page_size even
when offload_page_size was used; update the log to print the correct page size
depending on which branch computed inner_dim (use offload_page_size for the
fully-contiguous branch and engine_page_size for the other), and add a short
comment by each branch explaining why different page sizes are chosen for
fully-contiguous vs non-fully-contiguous layouts to prevent future confusion.

.create_layout(layout_type, device_tensors)?;

let layout_builder = layout_builder.clone();
let layout_builder = layout_builder.page_size(config.offload_page_size).clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Builder pattern mutation after build

Line 232 modifies the builder after it has already been used to build a layout on line 230. This is unusual and potentially error-prone as it violates the typical builder pattern where builders are consumed or finalized after build().

Consider creating a new builder instance for the offload configuration instead of mutating the existing one:

-        let layout_builder = layout_builder.page_size(config.offload_page_size).clone();
+        let offload_layout_builder = LayoutConfigBuilder::default()
+            .num_layers(num_layers)
+            .outer_dim(outer_dim)
+            .page_size(config.offload_page_size)
+            .inner_dim(inner_dim)
+            .dtype_width_bytes(config.dtype_width_bytes);

Then pass offload_layout_builder to the initialization functions.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let layout_builder = layout_builder.page_size(config.offload_page_size).clone();
let offload_layout_builder = LayoutConfigBuilder::default()
.num_layers(num_layers)
.outer_dim(outer_dim)
.page_size(config.offload_page_size)
.inner_dim(inner_dim)
.dtype_width_bytes(config.dtype_width_bytes);
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/distributed/worker.rs around lines 230-232, the
code calls build() on a layout builder and then mutates that same builder
(page_size(...)) on line 232 which breaks the builder pattern; instead, create a
separate builder instance for the offload configuration (e.g., clone or
construct a new LayoutBuilder with the same base settings), set the offload
page_size on that new builder, and pass the new offload_layout_builder into the
offload initialization functions so the original builder is not mutated after
build.

.num_layers(model.num_layers)
.outer_dim(model.outer_dim)
.page_size(model.page_size)
.page_size(model.page_size * self.config.offload_block_size_ratio)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Critical: inflates device page size as well; offload-only ratio must not affect engine/device.

layout_builder is generic and appears to seed both device and offload layouts. Multiplying here will make device pages too large, contradicting the PR goal (device small, host/disk large).

Minimal fix (keep engine/device at base size here; apply the ratio only in host/disk-specific builders/paths):

-            .page_size(model.page_size * self.config.offload_block_size_ratio)
+            .page_size(model.page_size)

Run to find usages that expect offload-sized pages and adjust them to multiply locally:

I can draft host/disk-specific builders (e.g., layout_builder_offload) if helpful.


🏁 Script executed:

#!/bin/bash
rg -n -C3 '\blayout_builder\(' lib/llm | sed -n '1,200p'

Length of output: 1849


Remove offload ratio from generic layout_builder

  • In lib/llm/src/block_manager/state/resources.rs (line 92), change
    - .page_size(model.page_size * self.config.offload_block_size_ratio)
    + .page_size(model.page_size)
  • Apply offload_block_size_ratio only in host/disk–specific layout builders (i.e. after calling resources.layout_builder() in offload code paths) so that device pages remain at the base size.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
.page_size(model.page_size * self.config.offload_block_size_ratio)
.page_size(model.page_size)
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/state/resources.rs around line 92, the generic
layout_builder is applying offload_block_size_ratio to the device page size
(".page_size(model.page_size * self.config.offload_block_size_ratio)"); remove
that multiplication so the generic builder uses the base page size (e.g.,
".page_size(model.page_size)"). Then apply the offload_block_size_ratio only in
host/disk-specific layout builders—after calling resources.layout_builder() in
offload code paths—by multiplying the returned layout/page_size there so device
pages remain at the base size while offloaded layouts use the scaled size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external-contribution Pull request is from an external contributor size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant