-
Notifications
You must be signed in to change notification settings - Fork 629
Draft: kvbm - bigger page size for offload block pools only #2986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This change allows the host and disk pool to have larger blocks compared to the device block pool, with a fixed block size ratio between these two pool classes. During offload and onboarding of blocks, we need to unite small device blocks to large offload blocks or to split large offloaded blocks back to small device blocks. The advantage conferred by this change: more efficient management of offloaded blocks, especifically when transferring back and forth from disk, larger IOs can always be used. To do: - Fix onboarding from disk to device - Fix support for TRTLLM - Add unit tests Signed-off-by: Dan Aloni <[email protected]>
👋 Hi da-x! Thank you for contributing to ai-dynamo/dynamo. Just a reminder: The 🚀 |
WalkthroughIntroduces separate engine and offload block/page sizes across KV block manager, vLLM/TRT-LLM connectors, and Python bindings. Adds fragmentation support using SmallVec, refactors transfer logic to operate on fragment-aware data, and updates builders/configs and accessors accordingly. Updates Python integration to pass both sizes. Adds smallvec dependency. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Py as Python (vLLM/TRT-LLM)
participant PyBind as PyO3 Binding
participant Bld as BlockManagerBuilder
participant KBM as KvBlockManager
participant Slot as SlotManager
Note over Py,PyBind: Initialization with dual sizes
Py->>PyBind: new(engine_page_size, offload_page_size, leader)
PyBind->>Bld: .engine_page_size(eps).offload_page_size(ops)
Bld->>KBM: build() with eps, ops, ratio
KBM-->>PyBind: KvBlockManager{engine, offload, ratio}
PyBind->>Slot: new(engine_block_size, offload_block_size)
Slot-->>Py: ready
rect rgba(200,230,255,0.3)
Note right of Slot: Token/block math uses engine size on device<br/>and offload size on host/disk
end
sequenceDiagram
autonumber
participant Src as Source BlockData
participant Dst as Dest BlockData
participant Xfer as CUDA copy_layers
Note over Src,Dst: Fragment-aware transfer
Xfer->>Src: is_fragmented?
alt Src fragmented
loop for each src fragment
Xfer->>Dst: copy fragment -> contiguous dst region (advance dst)
end
else Dst fragmented
loop for each dst fragment
Xfer->>Dst: copy contiguous src -> dst fragment (advance src)
end
else
Xfer->>Dst: single memcpy
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Pre-merge checks (2 passed, 1 warning)❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Poem
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 22
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (6)
lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs (1)
166-176
: Compute/validate sizes and avoid overflow on multiplication.Use checked_mul and enforce offload_page_size % engine_page_size == 0 before building the config.
Apply this diff:
@@ - let config = KvbmWorkerConfig::builder() + // Derive and validate sizes + let engine_page_size = page_size; + let offload_page_size = engine_page_size + .checked_mul(self.offload_block_size_ratio) + .ok_or_else(|| anyhow::anyhow!( + "offload_page_size overflow: engine_page_size={} ratio={}", + engine_page_size, self.offload_block_size_ratio + ))?; + debug_assert!( + offload_page_size % engine_page_size == 0, + "offload_page_size must be a multiple of engine_page_size" + ); + + let config = KvbmWorkerConfig::builder() .drt(self.drt.clone()) .num_device_blocks(num_device_blocks) - .offload_page_size(page_size * self.offload_block_size_ratio) - .engine_page_size(page_size) + .offload_page_size(offload_page_size) + .engine_page_size(engine_page_size) .tensors(vllm_tensors) .device_id(device_id) .dtype_width_bytes(dtype_width_bytes) .barrier_id_prefix(get_barrier_id_prefix()) .scheduler_client(Some(self.transfer_client.clone())) .build()?;Also applies to: 169-171
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)
87-96
: Validate engine/offload sizes and log ratio at construction.Guard against zero/invalid sizes and non-integer ratios to prevent downstream inconsistencies.
Apply this diff:
@@ pub fn new( worker_id: String, drt: PyDistributedRuntime, - engine_page_size: usize, - offload_page_size: usize, + engine_page_size: usize, + offload_page_size: usize, leader_py: PyKvbmLeader, ) -> Self { @@ - let kvbm_metrics = KvbmMetrics::new(&ns); + // Validate sizes upfront + assert!(engine_page_size > 0 && offload_page_size > 0, "page sizes must be > 0"); + assert!( + offload_page_size % engine_page_size == 0, + "offload_page_size ({}) must be a multiple of engine_page_size ({})", + offload_page_size, engine_page_size + ); + let ratio = offload_page_size / engine_page_size; + tracing::info!("engine_page_size={} offload_page_size={} ratio={}", engine_page_size, offload_page_size, ratio); + + let kvbm_metrics = KvbmMetrics::new(&ns); @@ let block_manager = match BlockManagerBuilder::new() .worker_id(0) .leader(leader_py) - .engine_page_size(engine_page_size) - .offload_page_size(offload_page_size) + .engine_page_size(engine_page_size) + .offload_page_size(offload_page_size) .disable_device_pool(false) .build() .awaitAlso applies to: 147-149
lib/bindings/python/rust/llm/block_manager/vllm.rs (1)
584-682
: Switch matching logic to offload-sized blocksget_num_new_matched_tokens currently uses engine_block_size for hashing, offsets and token arithmetic, but host()/disk() pools operate on offload_block_size. Update to:
- compute completed offload blocks:
num_completed_offload_blocks = num_computed_tokens / self.offload_block_size
- call
slot.sequence_hashes_offload()
instead ofslot.sequence_hashes(...)
- set
search_offset = num_completed_offload_blocks
- calculate
num_new_matched_tokens = num_matched_blocks * self.offload_block_size
- subtract
self.offload_block_size
in the boundary-pop caseAdd a
sequence_hashes_offload()
API if missing.lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (1)
549-582
: Add tests and document alignment logic in slot.rs
File: lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (549–582)
- No existing tests cover the
aligned_candidates
calculation oroffload_block_size_ratio
path. Add unit tests for:
num_candidate_blocks
divisible byoffload_block_size_ratio
num_candidate_blocks
with a non-zero remaindernum_candidate_blocks
smaller thanoffload_block_size_ratio
- Insert inline comments clarifying the math, for example:
// Calculate the largest multiple of offload_block_size_ratio // that does not exceed num_candidate_blocks let aligned_candidates = (num_candidate_blocks / self.offload_block_size_ratio) * self.offload_block_size_ratio;lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (2)
112-114
: Assert ratio invariants between engine/offload page sizes.Offload must be a multiple of engine page size; assert early to catch misconfig before scheduling.
Apply:
let kvbm_metrics = KvbmMetrics::new(&ns); let kvbm_metrics_clone = kvbm_metrics.clone(); + + debug_assert!( + engine_page_size > 0 && offload_page_size > 0, + "page sizes must be > 0" + ); + debug_assert!( + offload_page_size % engine_page_size == 0, + "offload_page_size ({}) must be a multiple of engine_page_size ({})", + offload_page_size, + engine_page_size + );
295-301
: Guard against underflow and enforce invariants in computed-tokens math.If inputs ever drift, this can go negative or misalign silently. Add debug asserts.
Apply:
// the second call will show num_external_tokens == 0 // this call is just letting us know the other blocks that are being used for the remainder of the prefill if num_external_tokens > 0 { + debug_assert!( + !block_ids.is_empty(), + "num_external_tokens > 0 requires at least one device block" + ); + debug_assert!( + num_external_tokens <= block_ids.len() * self.engine_page_size, + "num_external_tokens ({}) exceeds capacity of provided device blocks ({} * {})", + num_external_tokens, + block_ids.len(), + self.engine_page_size + ); let num_computed_tokens = block_ids.len() * self.engine_page_size - num_external_tokens; slot.record_cached_device_tokens(num_computed_tokens); slot.advance_computed_position(num_computed_tokens)?;
🧹 Nitpick comments (27)
lib/bindings/python/Cargo.toml (1)
55-55
: Unify smallvec version across the workspace (or gate features).Looks fine, but please ensure this exact version/feature set matches other crates (llm also adds smallvec). Consider moving to [workspace.dependencies] to prevent drift; if not, keep both crates pinned to the same version/features to avoid duplicate resolutions.
lib/llm/Cargo.toml (1)
159-159
: Match Python binding’s smallvec and consider workspace unification.Same note here: align with lib/bindings/python to the exact version/features or promote to [workspace.dependencies] for consistency.
lib/llm/src/block_manager/config.rs (1)
210-215
: Name/docs mismatch vs usage; add guardrails.The field is named “offload_block_size_ratio” but is used to scale page_size elsewhere. Consider renaming to offload_page_size_ratio or clarifying docs to “page size,” and add upper-bound validation (or cfg-driven limit) to prevent pathological ratios.
Apply this doc tweak locally:
- /// Ratio between offload block size and engine block size - /// Offload blocks are larger to enable more efficient I/O operations + /// Ratio between offload page size and engine page size + /// Offload pages are larger to enable more efficient I/O operationslib/llm/src/block_manager/distributed/utils.rs (2)
53-61
: Use smallvec! macro for clarity and tiny perf win.Constructing a single-element SmallVec via iterator is noisy.
- blocks: blocks - .into_iter() - .map(|(src, dst)| { - ( - [src].iter().map(|x| *x).collect(), - [dst].iter().map(|x| *x).collect(), - ) - }) - .collect(), + blocks: blocks + .into_iter() + .map(|(src, dst)| (smallvec![src], smallvec![dst])) + .collect(),
75-83
: Ditto: simplify constructor mapping with smallvec! macro.Same simplification here.
- blocks: blocks - .into_iter() - .map(|(src, dst)| { - ( - [src].iter().map(|x| *x).collect(), - [dst].iter().map(|x| *x).collect(), - ) - }) - .collect(), + blocks: blocks + .into_iter() + .map(|(src, dst)| (smallvec![src], smallvec![dst])) + .collect(),lib/llm/src/block_manager.rs (2)
114-117
: Avoid storing redundant derived state or assert invariants to prevent driftYou store both engine_block_size, offload_block_size, and the ratio. Either:
- compute offload_block_size on demand from engine_block_size * offload_block_size_ratio, or
- assert invariants at construction (offload_block_size >= engine_block_size, divisible, matches ratio).
Given this struct is Clone and long-lived, a future mutation could desync fields.
Apply simple invariant checks during construction (see suggested diffs on new()).
135-147
: Clarify accessor docs and unitsBoth engine_block_size() and offload_block_size() are documented as “Get the block size.” Be explicit about engine/device vs. offload/host-disk and units (bytes).
Apply:
- /// Get the block size + /// Get the engine (device) block size in bytes. pub fn engine_block_size(&self) -> usize { self.engine_block_size } - /// Get the block size + /// Get the offload (host/disk) block size in bytes. pub fn offload_block_size(&self) -> usize { self.offload_block_size } - /// Get the offload block size ratio + /// Get the offload block-size ratio (offload/engine). Always >= 1. pub fn offload_block_size_ratio(&self) -> usize { self.offload_block_size_ratio }lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs (1)
133-135
: Use distinct offload_page_size for TRT-LLM
TRT-LLM currently sets bothoffload_page_size
andengine_page_size
to the samepage_size
, forfeiting potential larger offload I/O. Consider plumbing a separateoffload_page_size
(e.g.page_size * offload_block_size_ratio
) to match vLLM’s behavior.lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs (1)
74-76
: Make offload_block_size_ratio configurable and validated against leader.A hard-coded ratio (32) risks drift with the leader’s explicit engine/offload sizes. Expose this via config/constructor or derive it from leader metadata, and validate consistency at registration time.
Would you like me to wire a setter from Python or plumb it from the leader metadata?
Also applies to: 121-121
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)
188-196
: Surface sizes via metrics or getters to aid debugging.Consider exporting engine/offload sizes (and ratio) in KvbmMetrics to catch configuration drift at runtime.
Also applies to: 190-192
lib/llm/src/block_manager/block/data/logical.rs (3)
59-59
: Avoid allocation; initialize SmallVec directly.Use SmallVec::from for the single-element case.
Apply this diff:
- block_ids: [block_id].iter().map(|x| *x).collect(), + block_ids: SmallVec::from([block_id]),
75-82
: Fix panic message and add backtrace for fragmented block misuse.Current message references LocalBlockData; add backtrace for parity.
Apply this diff:
fn block_id(&self) -> BlockId { - if self.block_ids.len() == 1 { - self.block_ids[0] - } else { - panic!("used LocalBlockData::block_id() for fragmented block"); - } + if self.block_ids.len() == 1 { + self.block_ids[0] + } else { + tracing::error!("Backtrace: {}", std::backtrace::Backtrace::force_capture()); + panic!("used LogicalBlockData::block_id() for fragmented block"); + } }
83-91
: Bounds-check fragment index and fix panic wording.Prevent generic index panics and standardize error text.
Apply this diff:
#[inline(always)] fn fragment_block_id(&self, idx: usize) -> BlockId { - if self.block_ids.len() != 1 { - self.block_ids[idx] - } else { - tracing::error!("Backtrace: {}", std::backtrace::Backtrace::force_capture()); - panic!("used LocalBlockData::fragment_block_id() for non-fragmented block"); - } + if self.block_ids.len() != 1 { + self.block_ids + .get(idx) + .copied() + .unwrap_or_else(|| { + tracing::error!("Backtrace: {}", std::backtrace::Backtrace::force_capture()); + panic!( + "LogicalBlockData::fragment_block_id() idx {} out of range (len={})", + idx, + self.block_ids.len() + ) + }) + } else { + tracing::error!("Backtrace: {}", std::backtrace::Backtrace::force_capture()); + panic!("used LogicalBlockData::fragment_block_id() for non-fragmented block"); + } }lib/bindings/python/rust/llm/block_manager.rs (2)
241-243
: Defaults make sense but Default::default() remains footgunnew() sets good defaults, but deriving Default will still produce zeros if someone uses BlockManagerBuilder::default() directly.
Provide an explicit Default to mirror new():
impl Default for BlockManagerBuilder { fn default() -> Self { Self { worker_id: 0, leader: None, engine_page_size: 32, offload_page_size: 1024, disable_device_pool: false, } } }
251-257
: Setters are fine; add basic guardsTiny guardrails help fail-fast on obvious misconfigurations.
pub fn engine_page_size(mut self, ps: usize) -> Self { - self.engine_page_size = ps; + debug_assert!(ps.is_power_of_two(), "engine_page_size should be a power of two"); + self.engine_page_size = ps; self } pub fn offload_page_size(mut self, ps: usize) -> Self { - self.offload_page_size = ps; + debug_assert!(ps.is_power_of_two(), "offload_page_size should be a power of two"); + self.offload_page_size = ps; self }lib/llm/src/block_manager/block/data.rs (3)
16-18
: Doc nit: clarify fragment_block_id semantics“The index of the block in the block set” duplicates block_id’s doc. Suggest “Returns the BlockId for the given fragment index of a fragmented block.”
69-81
: Guard against invalid fragment access to avoid panics downstreamAdd a lightweight debug assertion to catch invalid indices early before local implementations panic.
fn layer_view_fragment( &self, fragment_idx: usize, layer_idx: usize, outer_idx: usize, ) -> BlockResult<view::LayerView<'_, S>> { + debug_assert!( + self.is_fragmented() && fragment_idx < self.num_fragments(), + "invalid fragment_idx {} (num_fragments: {})", + fragment_idx, + self.num_fragments() + ); match self.is_local() { Some(views) => views.local_layer_view_fragment(fragment_idx, layer_idx, outer_idx), None => Err(BlockError::ViewsNotAvailableOnLogicalBlocks), } }
94-106
: Same guard for mutable fragment viewsMirror the check for the mutable variant.
fn layer_view_fragment_mut( &mut self, fragment_idx: usize, layer_idx: usize, outer_idx: usize, ) -> BlockResult<view::LayerViewMut<'_, S>> { + debug_assert!( + self.is_fragmented() && fragment_idx < self.num_fragments(), + "invalid fragment_idx {} (num_fragments: {})", + fragment_idx, + self.num_fragments() + ); match self.is_local_mut() { Some(views) => views.local_layer_view_fragment_mut(fragment_idx, layer_idx, outer_idx), None => Err(BlockError::ViewsNotAvailableOnLogicalBlocks), } }lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs (2)
55-57
: Remove the unusedoffload_page_size
field or implement its functionality.The
offload_page_size
field is marked with#[allow(unused)]
but is being passed through the constructor and stored. If this field will be used in a future implementation, consider adding a TODO comment explaining its planned usage. Otherwise, remove it to avoid confusion.If keeping for future use:
pub struct KvConnectorLeader { slot_manager: Arc<OnceLock<ConnectorSlotManager<String>>>, engine_page_size: usize, + // TODO: Implement offload page size functionality for TRTLLM connector #[allow(unused)] offload_page_size: usize,
176-178
: Consider extracting the page size alignment validation to a helper method.The modulo checks for page alignment appear multiple times in the code. Consider extracting this to a helper method for better maintainability and consistency.
+impl KvConnectorLeader { + #[inline] + fn is_page_aligned(&self, tokens: usize) -> bool { + tokens % self.engine_page_size == 0 + } +} fn get_num_new_matched_tokens( &mut self, request_id: String, request_num_tokens: usize, num_computed_tokens: usize, ) -> anyhow::Result<(usize, bool)> { // ... - if num_computed_tokens % self.engine_page_size != 0 { + if !self.is_page_aligned(num_computed_tokens) { return Ok((0, false)); } // ... - debug_assert!((num_computed_tokens + num_external_tokens) % self.engine_page_size == 0); + debug_assert!(self.is_page_aligned(num_computed_tokens + num_external_tokens));Also applies to: 186-187, 201-202
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (1)
1439-1445
: Simplify block pairs construction.The block pairs construction using iterator chains could be simplified for better readability.
-let block_pairs: Vec<_> = src_block_ids - .iter() - .zip( - onboard_req - .dst_block_ids - .chunks(engine_blocks_per_offload_block), - ) - .map(|(src, dst)| (SmallVec::from([*src]), SmallVec::from(dst))) - .collect(); +let dst_chunks = onboard_req.dst_block_ids.chunks(engine_blocks_per_offload_block); +let block_pairs: Vec<_> = src_block_ids + .iter() + .zip(dst_chunks) + .map(|(src, dst)| (SmallVec::from([*src]), SmallVec::from(dst))) + .collect();lib/llm/src/block_manager/block/data/local.rs (4)
11-13
: Fix typos in fragmentation comment.Minor clarity nits: “block IDx” → “block IDs”; capitalize “Offloaded”; trim redundant clause.
Apply:
- // The special case of multiple block IDx will only happen for small KV cache blocks on device. - // offloaded blocks will be large and `block_idxs.len() == 1` will hold for them. + // The special case of multiple block IDs will only happen for small KV cache blocks on device. + // Offloaded blocks will be large and `block_idxs.len() == 1`.
69-69
: Prefer smallvec! macro for single-element init.More idiomatic and avoids an iterator+collect.
Apply:
- block_idxs: [block_idx].iter().map(|x| *x).collect(), + block_idxs: smallvec![block_idx],And import the macro once:
use smallvec::SmallVec; +use smallvec::smallvec;
75-81
: Avoid unnecessary clone of WorkerID.WorkerID appears Copy (assigned by value elsewhere). Drop
.clone()
to reduce noise.Apply:
pub(crate) fn base(&self) -> LocalBlockDataBase<S> { LocalBlockDataBase { layout: self.layout.clone(), block_set_idx: self.block_set_idx, - worker_id: self.worker_id.clone(), + worker_id: self.worker_id, } }
11-14
: Optional: consider renamingblock_idxs
→block_ids
.Field stores BlockId values, not indices. A rename would improve readability and align with method names (
block_id
,fragment_block_id
). Non-functional, but helps future contributors.Also applies to: 21-26, 45-54, 56-82, 88-106, 127-134, 160-211, 213-239
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (2)
239-243
: Tail handling: early-exit may starve small remainders.Exiting when remaining < offload_page_size means requests with short tails won’t match until more tokens arrive or may never match for short prompts. If intended, fine; otherwise consider checking “tokens to next offload boundary” instead of full page.
Would you like a patch that computes:
- let need = (self.offload_page_size - (num_computed_tokens % self.offload_page_size)) % self.offload_page_size;
- return early only if remaining < need?
99-104
: Add page-size details to init log for easier debugging/telemetry.Apply:
- tracing::info!( - "KvConnectorLeader initialized with worker_id: {}", - worker_id - ); + tracing::info!( + "KvConnectorLeader initialized with worker_id: {}, engine_page_size: {}, offload_page_size: {}", + worker_id, engine_page_size, offload_page_size + );
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
Cargo.lock
is excluded by!**/*.lock
lib/bindings/python/Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (22)
lib/bindings/python/Cargo.toml
(1 hunks)lib/bindings/python/rust/llm/block_manager.rs
(4 hunks)lib/bindings/python/rust/llm/block_manager/distributed/worker.rs
(1 hunks)lib/bindings/python/rust/llm/block_manager/vllm.rs
(9 hunks)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs
(10 hunks)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs
(3 hunks)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs
(18 hunks)lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs
(8 hunks)lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs
(1 hunks)lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs
(3 hunks)lib/bindings/python/src/dynamo/llm/vllm_integration/connector_leader.py
(1 hunks)lib/llm/Cargo.toml
(1 hunks)lib/llm/src/block_manager.rs
(5 hunks)lib/llm/src/block_manager/block/data.rs
(5 hunks)lib/llm/src/block_manager/block/data/local.rs
(8 hunks)lib/llm/src/block_manager/block/data/logical.rs
(5 hunks)lib/llm/src/block_manager/block/transfer/cuda.rs
(1 hunks)lib/llm/src/block_manager/config.rs
(1 hunks)lib/llm/src/block_manager/distributed/transfer.rs
(4 hunks)lib/llm/src/block_manager/distributed/utils.rs
(4 hunks)lib/llm/src/block_manager/distributed/worker.rs
(7 hunks)lib/llm/src/block_manager/state/resources.rs
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-05-29T06:20:12.901Z
Learnt from: ryanolson
PR: ai-dynamo/dynamo#1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.
Applied to files:
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs
🧬 Code graph analysis (15)
lib/bindings/python/rust/llm/block_manager/vllm/connector/worker.rs (4)
lib/llm/src/block_manager/block/data.rs (1)
page_size
(44-44)lib/llm/src/block_manager/block/data/local.rs (1)
page_size
(147-149)lib/llm/src/block_manager/block/data/logical.rs (1)
page_size
(122-124)lib/llm/src/block_manager/layout.rs (1)
page_size
(250-252)
lib/bindings/python/src/dynamo/llm/vllm_integration/connector_leader.py (1)
lib/bindings/python/rust/llm/block_manager.rs (1)
leader
(259-262)
lib/llm/src/block_manager/distributed/worker.rs (2)
lib/bindings/python/rust/llm/block_manager.rs (2)
engine_page_size
(251-254)offload_page_size
(255-258)lib/llm/src/block_manager/state/resources.rs (1)
layout_builder
(84-97)
lib/bindings/python/rust/llm/block_manager/distributed/worker.rs (4)
lib/llm/src/block_manager/block/data.rs (1)
page_size
(44-44)lib/llm/src/block_manager/block/data/local.rs (1)
page_size
(147-149)lib/llm/src/block_manager/block/data/logical.rs (1)
page_size
(122-124)lib/llm/src/block_manager/layout.rs (1)
page_size
(250-252)
lib/bindings/python/rust/llm/block_manager/vllm.rs (2)
lib/bindings/python/rust/llm/block_manager.rs (4)
new
(70-174)new
(239-245)engine_block_size
(176-178)offload_block_size
(180-182)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (7)
new
(190-228)new
(343-379)new
(1098-1111)new
(1122-1135)new
(1145-1155)new
(1489-1494)new
(1498-1503)
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_worker.rs (4)
lib/llm/src/block_manager/block/data.rs (1)
page_size
(44-44)lib/llm/src/block_manager/block/data/local.rs (1)
page_size
(147-149)lib/llm/src/block_manager/block/data/logical.rs (1)
page_size
(122-124)lib/llm/src/block_manager/layout.rs (1)
page_size
(250-252)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (4)
lib/bindings/python/rust/llm/block_manager.rs (6)
engine_page_size
(251-254)offload_page_size
(255-258)worker_id
(247-250)leader
(259-262)new
(70-174)new
(239-245)lib/llm/src/block_manager.rs (3)
worker_id
(165-167)new
(196-212)new
(245-264)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)
new
(88-203)lib/bindings/python/src/dynamo/llm/vllm_integration/connector_leader.py (1)
KvConnectorLeader
(47-204)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)
lib/bindings/python/rust/llm/block_manager.rs (2)
engine_page_size
(251-254)offload_page_size
(255-258)
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (3)
lib/bindings/python/rust/llm/block_manager/vllm.rs (4)
block_manager
(76-78)new
(85-94)new
(333-357)new
(401-407)lib/bindings/python/rust/llm/block_manager.rs (4)
engine_block_size
(176-178)offload_block_size
(180-182)new
(70-174)new
(239-245)lib/llm/src/block_manager.rs (5)
engine_block_size
(135-137)offload_block_size
(140-142)offload_block_size_ratio
(145-147)new
(196-212)new
(245-264)
lib/llm/src/block_manager/block/data.rs (2)
lib/llm/src/block_manager/block/data/local.rs (6)
fragment_block_id
(99-106)is_fragmented
(127-129)num_fragments
(131-133)local_layer_view_fragment
(173-186)local_layer_view_mut
(188-197)local_layer_view_fragment_mut
(199-211)lib/llm/src/block_manager/block/data/logical.rs (3)
fragment_block_id
(84-91)is_fragmented
(109-111)num_fragments
(113-115)
lib/bindings/python/rust/llm/block_manager/vllm/connector/trtllm_leader.rs (5)
lib/bindings/python/rust/llm/block_manager.rs (6)
engine_page_size
(251-254)offload_page_size
(255-258)worker_id
(247-250)leader
(259-262)new
(70-174)new
(239-245)lib/bindings/python/rust/llm/block_manager/vllm.rs (3)
new
(85-94)new
(333-357)new
(401-407)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (2)
new
(92-183)new
(558-587)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/recorder.rs (1)
new
(88-203)lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (6)
new
(190-228)new
(343-379)new
(1098-1111)new
(1122-1135)new
(1145-1155)new
(1489-1494)
lib/llm/src/block_manager/distributed/utils.rs (1)
lib/llm/src/kv/layer.rs (2)
src
(653-657)dst
(659-663)
lib/llm/src/block_manager/block/data/local.rs (2)
lib/llm/src/block_manager/block/data.rs (8)
worker_id
(26-26)fragment_block_id
(17-17)is_fragmented
(35-35)num_fragments
(38-38)storage_type
(29-29)local_layer_view_fragment
(133-138)local_layer_view_fragment_mut
(148-153)local_block_view
(156-156)lib/llm/src/block_manager/block/data/logical.rs (5)
worker_id
(97-99)fragment_block_id
(84-91)is_fragmented
(109-111)num_fragments
(113-115)storage_type
(101-103)
lib/llm/src/block_manager/block/data/logical.rs (3)
lib/llm/src/block_manager/block/data.rs (4)
block_id
(14-14)fragment_block_id
(17-17)is_fragmented
(35-35)num_fragments
(38-38)lib/llm/src/block_manager/block/data/local.rs (4)
block_id
(89-96)fragment_block_id
(99-106)is_fragmented
(127-129)num_fragments
(131-133)lib/llm/src/block_manager/block.rs (2)
block_id
(372-374)block_id
(835-839)
lib/bindings/python/rust/llm/block_manager.rs (1)
lib/llm/src/block_manager.rs (4)
engine_block_size
(135-137)offload_block_size
(140-142)new
(196-212)new
(245-264)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: pre-merge-rust (lib/bindings/python)
- GitHub Check: pre-merge-rust (lib/runtime/examples)
- GitHub Check: pre-merge-rust (.)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (12)
lib/llm/src/block_manager/distributed/utils.rs (1)
6-6
: LGTM: smallvec import.Import is correct and feature-gated via serde in Cargo.
lib/llm/src/block_manager/block/data/logical.rs (1)
14-14
: No action needed:smallvec
is declared inlib/llm/Cargo.toml
(v1.15.1 withserde
feature).lib/bindings/python/rust/llm/block_manager.rs (1)
176-182
: Python-facing accessors for block sizes look goodExposes the two-size model cleanly to Python. No concerns.
lib/bindings/python/rust/llm/block_manager/vllm.rs (3)
86-89
: Wiring SlotManager with engine/offload sizes is correctGood propagation of the two-size model into SlotManager.
562-565
: ISL metrics split by device/host/disk look correctRight multipliers: device uses engine size; host/disk use offload size.
292-293
: Doc update aligns with two-size model“num_new_tokens / engine_block_size” clarifies the device-side unit. Looks good.
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader/slot.rs (1)
1196-1202
: Review comment resolved: cloningVllmBlockManager
is cheap (it derivesClone
on an internalArc
), so no change needed.lib/llm/src/block_manager/distributed/transfer.rs (1)
107-111
: No changes needed: LocalBlockDataBase::get_data simply wraps the SmallVec of indices into LocalBlockData and delegates any fragmentation logic to LocalBlockData.lib/llm/src/block_manager/block/data/local.rs (1)
45-54
: LGTM: base → data constructor.Clean separation via LocalBlockDataBase; cloning layout and reusing identifiers is correct.
lib/bindings/python/rust/llm/block_manager/vllm/connector/leader.rs (3)
251-254
: LGTM: alignment assert for onboarding tokens.The modulo against offload_page_size is the right invariant after matching.
130-138
: LGTM: BlockManagerBuilder wiring with dual page sizes.Builder calls align with the new two-size model; good to see
.disable_device_pool(false)
explicit.
556-585
: LGTM: PyO3 signature updated consistently.Constructor and recorder path accept both page sizes; matches downstream builder usage.
offload_page_size: usize, | ||
engine_page_size: usize, | ||
disable_device_pool: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
offload_page_size is never used; risk of silent misconfiguration
The builder stores offload_page_size but never propagates it into KvBlockManagerConfig. This makes the field effectively dead and prevents callers from controlling the offload size via the builder.
Apply this diff inside build() to derive and set the ratio from the two sizes (with validation):
@@
- let model_config = dynamo_llm::block_manager::KvManagerModelConfig::builder()
+ let model_config = dynamo_llm::block_manager::KvManagerModelConfig::builder()
.num_layers(1)
.outer_dim(1)
- .page_size(self.engine_page_size)
+ .page_size(self.engine_page_size)
.inner_dim(1)
.build()?;
config = config.model(model_config);
+ // Derive and apply offload ratio from sizes
+ if self.engine_page_size == 0 || self.offload_page_size == 0 {
+ return Err(anyhow::anyhow!("engine/offload page sizes must be > 0"));
+ }
+ if self.offload_page_size % self.engine_page_size != 0 {
+ return Err(anyhow::anyhow!(
+ "offload_page_size ({}) must be a multiple of engine_page_size ({})",
+ self.offload_page_size,
+ self.engine_page_size
+ ));
+ }
+ let ratio = self.offload_page_size / self.engine_page_size;
+ if !ratio.is_power_of_two() {
+ return Err(anyhow::anyhow!(
+ "offload/engine page-size ratio ({}) must be a power of two",
+ ratio
+ ));
+ }
+ // Requires KvBlockManagerConfigBuilder::offload_block_size_ratio(usize)
+ config = config.offload_block_size_ratio(ratio);
Additionally, consider validating an upper bound if required by pools (e.g., offload ≤ 1024).
Committable suggestion skipped: line range outside the PR's diff.
.num_layers(1) | ||
.outer_dim(1) | ||
.page_size(self.page_size) | ||
.page_size(self.engine_page_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Only engine page size is propagated; offload is ignored
build() writes model.page_size but never configures offload_block_size_ratio, so offload size falls back to the global default regardless of the builder’s offload_page_size.
Follow the ratio derivation diff in my earlier comment to ensure offload_page_size actually takes effect.
To confirm the API path exists, run:
🏁 Script executed:
#!/bin/bash
# Verify the builder exposes offload_block_size_ratio and find its usage.
rg -n --glob 'lib/**' 'offload_block_size_ratio' -C2
# Confirm current usage of offload_page_size and engine_page_size.
rg -n --glob 'lib/**' 'offload_page_size|engine_page_size' -C2
Length of output: 38270
Propagate offload_page_size in Python binding build()
In lib/bindings/python/rust/llm/block_manager.rs around line 292, after
.page_size(self.engine_page_size)
add
.offload_page_size(self.offload_page_size)
so the builder applies the configured offload size rather than falling back to the default.
🤖 Prompt for AI Agents
In lib/bindings/python/rust/llm/block_manager.rs around line 292, the builder
call stops at .page_size(self.engine_page_size) and doesn't propagate the
configured offload size; add a chained
.offload_page_size(self.offload_page_size) immediately after the page_size call
so the builder uses the self.offload_page_size value (ensure the field exists
and is in scope) instead of falling back to the default.
let config = KvbmWorkerConfig::builder() | ||
.drt(drt) | ||
.num_device_blocks(num_device_blocks) | ||
.page_size(page_size) | ||
.offload_page_size(page_size) | ||
.engine_page_size(page_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Configuration uses the same value for both page sizes
The configuration currently sets both offload_page_size
and engine_page_size
to the same value (page_size
), which appears to contradict the PR's objective of having "the host and disk pool to use larger block sizes than the device block pool." This likely requires the Python caller to pass separate values for these parameters.
Based on the PR objectives stating that offload blocks should be larger than device blocks, the Python bindings should accept two distinct page size parameters. Consider updating the function signature to accept both engine_page_size
and offload_page_size
:
- #[pyo3(signature = (num_device_blocks, page_size, tensors, device_id=0, dtype_width_bytes=2, drt=None, layout_blocking=false))]
+ #[pyo3(signature = (num_device_blocks, engine_page_size, offload_page_size, tensors, device_id=0, dtype_width_bytes=2, drt=None, layout_blocking=false))]
fn new(
num_device_blocks: usize,
- page_size: usize,
+ engine_page_size: usize,
+ offload_page_size: usize,
tensors: Vec<Py<PyAny>>,
And update the configuration accordingly:
let config = KvbmWorkerConfig::builder()
.drt(drt)
.num_device_blocks(num_device_blocks)
- .offload_page_size(page_size)
- .engine_page_size(page_size)
+ .offload_page_size(offload_page_size)
+ .engine_page_size(engine_page_size)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
let config = KvbmWorkerConfig::builder() | |
.drt(drt) | |
.num_device_blocks(num_device_blocks) | |
.page_size(page_size) | |
.offload_page_size(page_size) | |
.engine_page_size(page_size) | |
// Update the Python binding signature to accept two page sizes | |
#[pyo3(signature = (num_device_blocks, engine_page_size, offload_page_size, tensors, device_id=0, dtype_width_bytes=2, drt=None, layout_blocking=false))] | |
fn new( | |
num_device_blocks: usize, | |
engine_page_size: usize, | |
offload_page_size: usize, | |
tensors: Vec<Py<PyAny>>, | |
device_id: usize, | |
dtype_width_bytes: usize, | |
drt: Option<...>, | |
layout_blocking: bool, | |
) -> PyResult<Self> { | |
// ... | |
let config = KvbmWorkerConfig::builder() | |
.drt(drt) | |
.num_device_blocks(num_device_blocks) | |
.offload_page_size(offload_page_size) | |
.engine_page_size(engine_page_size) | |
// ... | |
} |
🤖 Prompt for AI Agents
In lib/bindings/python/rust/llm/block_manager/distributed/worker.rs around lines
137 to 141, the code currently assigns the same page_size to both
offload_page_size and engine_page_size; update the worker API and configuration
to accept distinct engine_page_size and offload_page_size values so offload
(host/disk) can be larger than device blocks. Change the Rust function signature
(and matching Python binding) to take two separate page size parameters, use
engine_page_size for .engine_page_size(...) and offload_page_size for
.offload_page_size(...), and update all callers/tests to pass the appropriate
values. Ensure input validation (e.g., offload >= device) or clear documentation
for callers if needed.
Slot::new(tokens.into(), self.engine_block_size, salt_hash), | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Slot initialization should carry both sizes to support offload hashing
Passing only engine_block_size makes it hard for Slot to produce offload-sized sequence hashes needed for host/disk caches.
If Slot::new can be extended, pass both sizes:
- Slot::new(tokens.into(), self.engine_block_size, salt_hash),
+ Slot::new(
+ tokens.into(),
+ self.engine_block_size,
+ self.offload_block_size,
+ salt_hash,
+ ),
Otherwise, provide a method on Slot to compute offload-sized sequence hashes on demand.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
Slot::new(tokens.into(), self.engine_block_size, salt_hash), | |
); | |
@@ lib/bindings/python/rust/llm/block_manager/vllm.rs | |
Slot::new( | |
tokens.into(), | |
self.engine_block_size, | |
self.offload_block_size, | |
salt_hash, | |
), | |
); |
🤖 Prompt for AI Agents
In lib/bindings/python/rust/llm/block_manager/vllm.rs around lines 432-433, the
Slot is being created with only engine_block_size which prevents Slot from
producing offload-sized sequence hashes; either extend Slot::new to accept both
sizes and change this call to pass engine_block_size and offload_block_size
(plus salt_hash), or if modifying Slot::new is undesirable, add a method on Slot
(e.g., compute_offload_sequence_hash(offload_block_size)) and call that after
construction so Slot can produce offload-sized sequence hashes for host/disk
caches.
debug_assert!(offload_block_size.is_power_of_two() && offload_block_size <= 1024); | ||
debug_assert!(engine_block_size.is_power_of_two() && engine_block_size <= 1024); | ||
let sequence = TokenBlockSequence::new(tokens, offload_block_size as u32, Some(salt_hash)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add validation for block size relationship.
The code assumes that offload_block_size
is a multiple of engine_block_size
, but this is only checked with debug assertions. Add a runtime validation to prevent potential issues in release builds.
let offload_block_size = block_manager.offload_block_size();
let engine_block_size = block_manager.engine_block_size();
let offload_block_size_ratio = block_manager.offload_block_size_ratio();
+
+// Validate block size relationship
+if offload_block_size % engine_block_size != 0 {
+ panic!("offload_block_size ({}) must be a multiple of engine_block_size ({})",
+ offload_block_size, engine_block_size);
+}
+if offload_block_size / engine_block_size != offload_block_size_ratio {
+ panic!("Inconsistent block size ratio: {} / {} != {}",
+ offload_block_size, engine_block_size, offload_block_size_ratio);
+}
+
debug_assert!(offload_block_size.is_power_of_two() && offload_block_size <= 1024);
debug_assert!(engine_block_size.is_power_of_two() && engine_block_size <= 1024);
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
debug_assert!(offload_block_size.is_power_of_two() && offload_block_size <= 1024); | |
debug_assert!(engine_block_size.is_power_of_two() && engine_block_size <= 1024); | |
let sequence = TokenBlockSequence::new(tokens, offload_block_size as u32, Some(salt_hash)); | |
let offload_block_size = block_manager.offload_block_size(); | |
let engine_block_size = block_manager.engine_block_size(); | |
let offload_block_size_ratio = block_manager.offload_block_size_ratio(); | |
// Validate block size relationship | |
if offload_block_size % engine_block_size != 0 { | |
panic!( | |
"offload_block_size ({}) must be a multiple of engine_block_size ({})", | |
offload_block_size, | |
engine_block_size | |
); | |
} | |
if offload_block_size / engine_block_size != offload_block_size_ratio { | |
panic!( | |
"Inconsistent block size ratio: {} / {} != {}", | |
offload_block_size, | |
engine_block_size, | |
offload_block_size_ratio | |
); | |
} | |
debug_assert!(offload_block_size.is_power_of_two() && offload_block_size <= 1024); | |
debug_assert!(engine_block_size.is_power_of_two() && engine_block_size <= 1024); | |
let sequence = TokenBlockSequence::new(tokens, offload_block_size as u32, Some(salt_hash)); |
pub from_pool: BlockTransferPool, | ||
pub to_pool: BlockTransferPool, | ||
pub blocks: Vec<(usize, usize)>, | ||
pub blocks: Vec<(SmallVec<[usize; 1]>, SmallVec<[usize; 1]>)>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Wire format breaking change for blocks; add serde compat or version the message.
Changing blocks to Vec<(SmallVec<[usize;1]>, SmallVec<[usize;1]>)> alters the serialized shape from [[src,dst], …] to [[[...srcs],[...dsts]], …]. Any older producer/consumer (Rust/Python/TRTLLM) deserializing the old shape will break.
Two options:
- Back-compat serde adapters on this field to accept both shapes (old pair and new fragments).
- Bump protocol/schema version for ZMQ_TRANSFER_BLOCKS_MESSAGE and migrate all call sites atomically.
I can provide a serde adapter so we accept both without changing external callers. Would you like me to draft it?
🏁 Script executed:
#!/bin/bash
# Locate (de)serializers and external users of transfer_blocks to assess impact.
rg -n 'transfer_blocks|BlockTransferRequest|blocks' -C3
Length of output: 307985
Wire-format breaking change: make BlockTransferRequest.blocks backward-compatible or version the message
Changing blocks to Vec<(SmallVec<[usize;1]>, SmallVec<[usize;1]>)> changes the JSON shape from [[src,dst], …] to [[[srcs],[dsts]], …]; incoming ZMQ/serde_json deserialization will break existing producers/consumers.
Locations:
- Field definition: lib/llm/src/block_manager/distributed/utils.rs:34–41 (blocks at line 37; ZMQ_TRANSFER_BLOCKS_MESSAGE const at line 11).
- Deserializer (incoming message): lib/llm/src/block_manager/distributed/transfer.rs:157 (serde_json::from_slice::).
- Sender/usage sites: lib/llm/src/block_manager/distributed/leader.rs:397–402 (transfer_blocks_request) and callers in lib/llm/src/block_manager/distributed.rs (e.g. around lines 178–206).
Recommended actions (choose one):
- Add a serde-compatible adapter on the blocks field to accept both legacy pairs [src,dst] and the new pair-of-arrays form (use #[serde(deserialize_with = "...")] with a Visitor that accepts both shapes). This keeps wire-compatibility.
- Bump the protocol/schema for ZMQ_TRANSFER_BLOCKS_MESSAGE and roll an atomic migration across all producers/consumers.
I will implement the serde adapter if you want it drafted.
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/distributed/utils.rs around line 37, the blocks
field type change altered the JSON wire shape and will break existing
producers/consumers; add a serde-compatible adapter to deserialize both legacy
pairs [src,dst] and the new pair-of-arrays form [[[srcs],[dsts]]]. Implement a
deserialize_with function (in the same module or a nearby utils module) that
provides a Visitor which accepts either (a) a sequence of two integers (map to
SmallVec<[usize;1]> for each side) or (b) a sequence of two sequences of
integers (map each inner sequence to SmallVec<[usize;1]>), and use
#[serde(deserialize_with = "your_adapter")] on the blocks field so incoming
ZMQ/serde_json::from_slice::<BlockTransferRequest> accepts both formats while
preserving the existing type; alternatively, if you prefer schema bumping,
increment the protocol version for ZMQ_TRANSFER_BLOCKS_MESSAGE and coordinate
migrations across sender/receiver sites instead of adding the adapter.
#[builder(default = "32")] | ||
page_size: usize, | ||
engine_page_size: usize, | ||
|
||
#[builder(default = "32")] | ||
offload_page_size: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Consider validating the offload_block_size_ratio relationship
The configuration now has separate engine_page_size
and offload_page_size
fields, but based on the PR objectives and the code snippet from lib/llm/src/block_manager/state/resources.rs
, there's an offload_block_size_ratio
that should maintain a fixed ratio between these sizes.
🏁 Script executed:
#!/bin/bash
# Check how offload_block_size_ratio is defined and used
ast-grep --pattern 'offload_block_size_ratio'
# Look for any validation of the relationship between engine and offload sizes
rg -n "engine_page_size|offload_page_size" --type rust -C3 | grep -E "(assert|validate|check|verify)"
Length of output: 1790
Validate offload_page_size maintains the configured ratio
Add a check in the WorkerConfig
builder (lib/llm/src/block_manager/distributed/worker.rs, lines 93–97) or in build()
to assert that
offload_page_size == engine_page_size * offload_block_size_ratio
Currently no validation exists to enforce this relationship (see usages in lib/llm/src/block_manager.rs), so misconfigured values can produce incorrect offload block sizes.
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/distributed/worker.rs around lines 93–97, the
builder sets engine_page_size and offload_page_size but lacks validation that
offload_page_size == engine_page_size * offload_block_size_ratio; add a check in
the WorkerConfig builder's build() (or the builder validation hook) that
computes expected_offload = engine_page_size * offload_block_size_ratio and
compares it to offload_page_size, returning a clear Err/Result (or panic with a
descriptive error) when they differ so misconfigurations cannot produce
incorrect offload block sizes.
}; | ||
let num_layers = device_tensors.len(); | ||
let inner_dim = shape[2..].iter().product::<usize>() / config.page_size; | ||
let inner_dim = shape[2..].iter().product::<usize>() / config.engine_page_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent page_size usage in inner_dim calculation
The inner_dim
calculation uses engine_page_size
for non-fully-contiguous layouts (line 177) but offload_page_size
for fully-contiguous layouts (line 196). This inconsistency could lead to incorrect memory layout calculations.
Based on the layout type distinction, it appears intentional that different layouts use different page sizes. However, the logging on line 201 incorrectly shows engine_page_size
when the calculation used offload_page_size
. Fix the logging:
tracing::info!(
"Inferred layout: num_layers={}, outer_dim={}, page_size={}, inner_dim={}",
num_layers,
outer_dim,
- config.engine_page_size,
+ config.offload_page_size,
inner_dim
);
Additionally, consider adding a comment explaining why different page sizes are used for different layout types to prevent future confusion.
Also applies to: 196-196
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/distributed/worker.rs around lines 177 and 196, the
inner_dim calculation uses different page sizes for different layouts
(engine_page_size for non-fully-contiguous at line 177 and offload_page_size for
fully-contiguous at line 196) but the log at ~201 prints engine_page_size even
when offload_page_size was used; update the log to print the correct page size
depending on which branch computed inner_dim (use offload_page_size for the
fully-contiguous branch and engine_page_size for the other), and add a short
comment by each branch explaining why different page sizes are chosen for
fully-contiguous vs non-fully-contiguous layouts to prevent future confusion.
.create_layout(layout_type, device_tensors)?; | ||
|
||
let layout_builder = layout_builder.clone(); | ||
let layout_builder = layout_builder.page_size(config.offload_page_size).clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Builder pattern mutation after build
Line 232 modifies the builder after it has already been used to build a layout on line 230. This is unusual and potentially error-prone as it violates the typical builder pattern where builders are consumed or finalized after build()
.
Consider creating a new builder instance for the offload configuration instead of mutating the existing one:
- let layout_builder = layout_builder.page_size(config.offload_page_size).clone();
+ let offload_layout_builder = LayoutConfigBuilder::default()
+ .num_layers(num_layers)
+ .outer_dim(outer_dim)
+ .page_size(config.offload_page_size)
+ .inner_dim(inner_dim)
+ .dtype_width_bytes(config.dtype_width_bytes);
Then pass offload_layout_builder
to the initialization functions.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
let layout_builder = layout_builder.page_size(config.offload_page_size).clone(); | |
let offload_layout_builder = LayoutConfigBuilder::default() | |
.num_layers(num_layers) | |
.outer_dim(outer_dim) | |
.page_size(config.offload_page_size) | |
.inner_dim(inner_dim) | |
.dtype_width_bytes(config.dtype_width_bytes); |
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/distributed/worker.rs around lines 230-232, the
code calls build() on a layout builder and then mutates that same builder
(page_size(...)) on line 232 which breaks the builder pattern; instead, create a
separate builder instance for the offload configuration (e.g., clone or
construct a new LayoutBuilder with the same base settings), set the offload
page_size on that new builder, and pass the new offload_layout_builder into the
offload initialization functions so the original builder is not mutated after
build.
.num_layers(model.num_layers) | ||
.outer_dim(model.outer_dim) | ||
.page_size(model.page_size) | ||
.page_size(model.page_size * self.config.offload_block_size_ratio) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Critical: inflates device page size as well; offload-only ratio must not affect engine/device.
layout_builder is generic and appears to seed both device and offload layouts. Multiplying here will make device pages too large, contradicting the PR goal (device small, host/disk large).
Minimal fix (keep engine/device at base size here; apply the ratio only in host/disk-specific builders/paths):
- .page_size(model.page_size * self.config.offload_block_size_ratio)
+ .page_size(model.page_size)
Run to find usages that expect offload-sized pages and adjust them to multiply locally:
I can draft host/disk-specific builders (e.g., layout_builder_offload) if helpful.
🏁 Script executed:
#!/bin/bash
rg -n -C3 '\blayout_builder\(' lib/llm | sed -n '1,200p'
Length of output: 1849
Remove offload ratio from generic layout_builder
- In lib/llm/src/block_manager/state/resources.rs (line 92), change
- .page_size(model.page_size * self.config.offload_block_size_ratio) + .page_size(model.page_size)
- Apply offload_block_size_ratio only in host/disk–specific layout builders (i.e. after calling resources.layout_builder() in offload code paths) so that device pages remain at the base size.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
.page_size(model.page_size * self.config.offload_block_size_ratio) | |
.page_size(model.page_size) |
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/state/resources.rs around line 92, the generic
layout_builder is applying offload_block_size_ratio to the device page size
(".page_size(model.page_size * self.config.offload_block_size_ratio)"); remove
that multiplication so the generic builder uses the base page size (e.g.,
".page_size(model.page_size)"). Then apply the offload_block_size_ratio only in
host/disk-specific layout builders—after calling resources.layout_builder() in
offload code paths—by multiplying the returned layout/page_size there so device
pages remain at the base size while offloaded layouts use the scaled size.
This change allows the host and disk pool to have larger blocks compared to the device block pool, with a fixed block size ratio between these two pool classes. During offload and onboarding of blocks, we need to unite small device blocks to large offload blocks or to split large offloaded blocks back to small device blocks.
The advantage conferred by this change: more efficient management of offloaded blocks, especifically when transferring back and forth from disk, larger IOs can always be used.
To do:
Summary by CodeRabbit
New Features
Refactor
Chores