Skip to content

Conversation

@jthomson04
Copy link
Contributor

@jthomson04 jthomson04 commented Nov 7, 2025

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • Performance Improvements
    • Enhanced pinned memory allocation with NUMA-aware and huge pages support options.
    • Introduced DYN_KVBM_USE_HUGE_PAGES environment variable for memory optimization tuning.
    • Improved memory allocation robustness with intelligent fallback mechanisms and better error handling.

Signed-off-by: jthomson04 <[email protected]>
@jthomson04 jthomson04 requested a review from a team as a code owner November 7, 2025 17:32
@github-actions github-actions bot added the feat label Nov 7, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 7, 2025

Walkthrough

Enhances pinned memory allocation in the CUDA block manager with NUMA-awareness and huge pages support. Replaces single-path allocation with conditional logic (2x2 match on numa_aware and use_huge_pages flags), each path offering different allocation strategies with corresponding error handling and fallbacks.

Changes

Cohort / File(s) Summary
CUDA Memory Allocation Enhancement
lib/llm/src/block_manager/storage/cuda.rs
Adds NUMA-awareness and huge pages support to pinned memory allocation via DYN_KVBM_USE_HUGE_PAGES environment variable. Replaces single-path allocation with four conditional paths: InvalidConfig when both features enabled, NUMA allocator with host malloc fallback when NUMA enabled, mmap huge page allocation with cudaHostRegister when huge pages enabled, and standard malloc_host when neither enabled. Updates error handling, logging, and imports for cudaError, cudaHostRegister, and mmap constants.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Review conditional logic across all four allocation paths (NUMA + huge pages combinations) for correctness and mutual exclusivity
  • Verify low-level CUDA and mmap operations, including cudaHostRegister behavior and memory registration error handling
  • Examine NUMA allocator fallback logic and host malloc error recovery paths
  • Validate environment variable parsing and configuration validation (InvalidConfig scenario)
  • Check proper resource cleanup and potential memory leaks in failure paths

Poem

🐰 Whiskers twitch with NUMA glee,
Huge pages dance across the sea,
Four paths now for memory's flow,
Pinned and swift, our buffers glow!
Smart allocation, oh what a show! 🎉

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is a template with empty sections and placeholder values that contains no implementation details or concrete information about the changes. Fill in all sections with concrete details: summarize the NUMA and huge pages implementation, explain the 2x2 allocation strategy, specify the modified file (cuda.rs), and replace the placeholder issue reference with an actual issue number.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: enabling fast pinned memory allocation in KVBM with the addition of NUMA-awareness and huge pages support.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f509493 and afdd18d.

📒 Files selected for processing (1)
  • lib/llm/src/block_manager/storage/cuda.rs (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/block/transfer/cuda.rs:416-420
Timestamp: 2025-09-18T21:49:28.906Z
Learning: Pinned memory (page-locked host memory) being accessible by GPU via DMA doesn't change the fact that host and device pointers exist in separate virtual address spaces, so overlap checks between host and device pointers are still invalid regardless of whether the host memory is pinned.
📚 Learning: 2025-09-18T21:49:28.906Z
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/block/transfer/cuda.rs:416-420
Timestamp: 2025-09-18T21:49:28.906Z
Learning: Pinned memory (page-locked host memory) being accessible by GPU via DMA doesn't change the fact that host and device pointers exist in separate virtual address spaces, so overlap checks between host and device pointers are still invalid regardless of whether the host memory is pinned.

Applied to files:

  • lib/llm/src/block_manager/storage/cuda.rs
🧬 Code graph analysis (1)
lib/llm/src/block_manager/storage/cuda.rs (2)
lib/llm/src/block_manager/numa_allocator.rs (2)
  • is_numa_enabled (15-19)
  • std (56-56)
lib/llm/src/block_manager/numa_allocator/worker_pool.rs (1)
  • global (311-314)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: operator (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: operator (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: clippy (lib/runtime/examples)
  • GitHub Check: tests (.)
  • GitHub Check: tests (lib/bindings/python)
  • GitHub Check: tests (lib/runtime/examples)
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: tests (launch/dynamo-run)
  • GitHub Check: clippy (launch/dynamo-run)
  • GitHub Check: clippy (.)
  • GitHub Check: Build and Test - dynamo

Comment on lines +228 to +235
if cudaHostRegister(ptr, size, 0) != cudaError::cudaSuccess {
return Err(StorageError::AllocationFailed(
"Failed to register memory".into(),
));
}

ptr as *mut u8
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Wrong teardown for huge-page allocations

In the huge-page branch we mmap anonymous pages and register them with cudaHostRegister, but Drop still calls cudarc::driver::result::free_host, which is only valid for pointers returned by cuMemHostAlloc. Doing that on an mmapd region is undefined behaviour and leaks the huge-page reservation. We also skip cudaHostUnregister / munmap on both the success case and the error path where cudaHostRegister fails, so the mapping lives forever.

Please track whether the allocation came from the huge-page path, call cudaHostUnregister + munmap on drop, and unwind the mapping whenever registration fails. Something along these lines should address it:

@@
-use cudarc::runtime::sys::{cudaError, cudaHostRegister};
+use cudarc::runtime::sys::{cudaError, cudaHostRegister, cudaHostUnregister};
@@
 pub struct PinnedStorage {
     ptr: u64,
     size: usize,
     handles: RegistrationHandles,
     ctx: Arc<CudaContext>,
+    huge_page_alloc: bool,
 }
@@
-            let ptr = match (numa_aware, use_huge_pages) {
+            let mut huge_page_alloc = false;
+            let ptr = match (numa_aware, use_huge_pages) {
@@
                 (false, true) => {
                     let ptr = nix::libc::mmap(
@@
                     if ptr == MAP_FAILED {
                         return Err(StorageError::AllocationFailed(
                             "Failed to allocate pinned memory".into(),
                         ));
                     }
 
-                    if cudaHostRegister(ptr, size, 0) != cudaError::cudaSuccess {
-                        return Err(StorageError::AllocationFailed(
-                            "Failed to register memory".into(),
-                        ));
+                    if cudaHostRegister(ptr, size, 0) != cudaError::cudaSuccess {
+                        unsafe { nix::libc::munmap(ptr, size) };
+                        return Err(StorageError::AllocationFailed(
+                            "Failed to register memory".into(),
+                        ));
                     }
 
+                    huge_page_alloc = true;
                     ptr as *mut u8
                 }
@@
             Ok(Self {
                 ptr,
                 size,
                 handles: RegistrationHandles::new(),
                 ctx: ctx.clone(),
+                huge_page_alloc,
             })
@@
     fn drop(&mut self) {
         self.handles.release();
-        unsafe { cudarc::driver::result::free_host(self.ptr as _) }.unwrap();
+        unsafe {
+            if self.huge_page_alloc {
+                cudaHostUnregister(self.ptr as _);
+                nix::libc::munmap(self.ptr as *mut _, self.size);
+            } else {
+                cudarc::driver::result::free_host(self.ptr as _).unwrap();
+            }
+        }
     }

Also applies to: 259-261

🤖 Prompt for AI Agents
In lib/llm/src/block_manager/storage/cuda.rs around lines 228-235 (and similarly
lines 259-261), the huge-page allocation branch mmaps anonymous memory and
registers it with cudaHostRegister but the Drop implementation and error paths
still call cudarc::driver::result::free_host (valid only for cuMemHostAlloc),
and they omit cudaHostUnregister/munmap, causing UB and leaks; change the
allocation tracking to record whether the pointer came from the huge-page (mmap)
path, and on Drop call cudaHostUnregister then munmap for huge-page allocations
(instead of free_host), ensure that if cudaHostRegister fails you immediately
call munmap to unwind the mapping before returning Err, and keep using free_host
only for cuMemHostAlloc-backed allocations so both success and error paths
properly clean up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants