-
Notifications
You must be signed in to change notification settings - Fork 686
feat: Enable fast pinned memory alloc in KVBM #4184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: jthomson04 <[email protected]>
WalkthroughEnhances pinned memory allocation in the CUDA block manager with NUMA-awareness and huge pages support. Replaces single-path allocation with conditional logic (2x2 match on numa_aware and use_huge_pages flags), each path offering different allocation strategies with corresponding error handling and fallbacks. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lib/llm/src/block_manager/storage/cuda.rs(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/block/transfer/cuda.rs:416-420
Timestamp: 2025-09-18T21:49:28.906Z
Learning: Pinned memory (page-locked host memory) being accessible by GPU via DMA doesn't change the fact that host and device pointers exist in separate virtual address spaces, so overlap checks between host and device pointers are still invalid regardless of whether the host memory is pinned.
📚 Learning: 2025-09-18T21:49:28.906Z
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/block/transfer/cuda.rs:416-420
Timestamp: 2025-09-18T21:49:28.906Z
Learning: Pinned memory (page-locked host memory) being accessible by GPU via DMA doesn't change the fact that host and device pointers exist in separate virtual address spaces, so overlap checks between host and device pointers are still invalid regardless of whether the host memory is pinned.
Applied to files:
lib/llm/src/block_manager/storage/cuda.rs
🧬 Code graph analysis (1)
lib/llm/src/block_manager/storage/cuda.rs (2)
lib/llm/src/block_manager/numa_allocator.rs (2)
is_numa_enabled(15-19)std(56-56)lib/llm/src/block_manager/numa_allocator/worker_pool.rs (1)
global(311-314)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
- GitHub Check: trtllm (arm64)
- GitHub Check: trtllm (amd64)
- GitHub Check: sglang (arm64)
- GitHub Check: operator (arm64)
- GitHub Check: sglang (amd64)
- GitHub Check: vllm (amd64)
- GitHub Check: operator (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: clippy (lib/runtime/examples)
- GitHub Check: tests (.)
- GitHub Check: tests (lib/bindings/python)
- GitHub Check: tests (lib/runtime/examples)
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: tests (launch/dynamo-run)
- GitHub Check: clippy (launch/dynamo-run)
- GitHub Check: clippy (.)
- GitHub Check: Build and Test - dynamo
| if cudaHostRegister(ptr, size, 0) != cudaError::cudaSuccess { | ||
| return Err(StorageError::AllocationFailed( | ||
| "Failed to register memory".into(), | ||
| )); | ||
| } | ||
|
|
||
| ptr as *mut u8 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong teardown for huge-page allocations
In the huge-page branch we mmap anonymous pages and register them with cudaHostRegister, but Drop still calls cudarc::driver::result::free_host, which is only valid for pointers returned by cuMemHostAlloc. Doing that on an mmapd region is undefined behaviour and leaks the huge-page reservation. We also skip cudaHostUnregister / munmap on both the success case and the error path where cudaHostRegister fails, so the mapping lives forever.
Please track whether the allocation came from the huge-page path, call cudaHostUnregister + munmap on drop, and unwind the mapping whenever registration fails. Something along these lines should address it:
@@
-use cudarc::runtime::sys::{cudaError, cudaHostRegister};
+use cudarc::runtime::sys::{cudaError, cudaHostRegister, cudaHostUnregister};
@@
pub struct PinnedStorage {
ptr: u64,
size: usize,
handles: RegistrationHandles,
ctx: Arc<CudaContext>,
+ huge_page_alloc: bool,
}
@@
- let ptr = match (numa_aware, use_huge_pages) {
+ let mut huge_page_alloc = false;
+ let ptr = match (numa_aware, use_huge_pages) {
@@
(false, true) => {
let ptr = nix::libc::mmap(
@@
if ptr == MAP_FAILED {
return Err(StorageError::AllocationFailed(
"Failed to allocate pinned memory".into(),
));
}
- if cudaHostRegister(ptr, size, 0) != cudaError::cudaSuccess {
- return Err(StorageError::AllocationFailed(
- "Failed to register memory".into(),
- ));
+ if cudaHostRegister(ptr, size, 0) != cudaError::cudaSuccess {
+ unsafe { nix::libc::munmap(ptr, size) };
+ return Err(StorageError::AllocationFailed(
+ "Failed to register memory".into(),
+ ));
}
+ huge_page_alloc = true;
ptr as *mut u8
}
@@
Ok(Self {
ptr,
size,
handles: RegistrationHandles::new(),
ctx: ctx.clone(),
+ huge_page_alloc,
})
@@
fn drop(&mut self) {
self.handles.release();
- unsafe { cudarc::driver::result::free_host(self.ptr as _) }.unwrap();
+ unsafe {
+ if self.huge_page_alloc {
+ cudaHostUnregister(self.ptr as _);
+ nix::libc::munmap(self.ptr as *mut _, self.size);
+ } else {
+ cudarc::driver::result::free_host(self.ptr as _).unwrap();
+ }
+ }
}Also applies to: 259-261
🤖 Prompt for AI Agents
In lib/llm/src/block_manager/storage/cuda.rs around lines 228-235 (and similarly
lines 259-261), the huge-page allocation branch mmaps anonymous memory and
registers it with cudaHostRegister but the Drop implementation and error paths
still call cudarc::driver::result::free_host (valid only for cuMemHostAlloc),
and they omit cudaHostUnregister/munmap, causing UB and leaks; change the
allocation tracking to record whether the pointer came from the huge-page (mmap)
path, and on Drop call cudaHostUnregister then munmap for huge-page allocations
(instead of free_host), ensure that if cudaHostRegister fails you immediately
call munmap to unwind the mapping before returning Err, and keep using free_host
only for cuMemHostAlloc-backed allocations so both success and error paths
properly clean up.
Overview:
Details:
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit
DYN_KVBM_USE_HUGE_PAGESenvironment variable for memory optimization tuning.