Skip to content

Conversation

ykwd
Copy link
Collaborator

@ykwd ykwd commented Jul 18, 2025

Content

Import a new offset allocator to address the following issue:

  1. Support memory allocation with a size larger than 16MB, which will help eliminate redundant memcpy invocations within non–zero-copy get interfaces. refactor(store): replace manual slice management with offset_allocator #624
  2. Auto-merge the freed memory to avoid this bug: [Feature Request]: does mooncacke-store need a cachelib rebalance strategy #632

Compared with the original implementation, the imported OffsetAllocator is wrapped to support registering or allocating with a size larger than 3.75 GB, and is further optimized for LLM inference scenario. The optimization will:
a) slightly decrease the memory utilization ratio in general cases;
b) makes no difference when the allocated size is equal to a bin size;
c) largely improve the memory utilization ratio when the allocated size is mostly uniform and not equal to any bin size.
The LLM inference tasks, in most cases, will put objects with mostly uniform sizes, which falls into the latter two cases.
See the following benchmark reports for details.

Result

  • alloc size: The size of each object
  • utilization ratio: The total allocated size / total space
  • time: time in nanoseconds for each object allocation
  • OffsetAllocator optimization: whether round up the allocated size to a bin size

Uniform size, size equals power of 2

OffsetAllocator (After Optimization)

Alloc size: 32, min util ratio: 1, avg util ratio: 1, time: 544 ns
Alloc size: 128, min util ratio: 1, avg util ratio: 1, time: 417 ns
Alloc size: 512, min util ratio: 1, avg util ratio: 1, time: 174 ns
Alloc size: 2048, min util ratio: 1, avg util ratio: 1, time: 406 ns
Alloc size: 8192, min util ratio: 1, avg util ratio: 1, time: 180 ns
Alloc size: 32768, min util ratio: 1, avg util ratio: 1, time: 133 ns
Alloc size: 131072, min util ratio: 1, avg util ratio: 1, time: 109 ns
Alloc size: 524288, min util ratio: 1, avg util ratio: 1, time: 100 ns
Alloc size: 2097152, min util ratio: 1, avg util ratio: 1, time: 99 ns
Alloc size: 8388608, min util ratio: 1, avg util ratio: 1, time: 99 ns
Alloc size: 33554432, min util ratio: 1, avg util ratio: 1, time: 98 ns

OffsetAllocator (Before Optimization)

Alloc size: 32, min util ratio: 1, avg util ratio: 1, time: 539 ns
Alloc size: 128, min util ratio: 1, avg util ratio: 1, time: 419 ns
Alloc size: 512, min util ratio: 1, avg util ratio: 1, time: 217 ns
Alloc size: 2048, min util ratio: 1, avg util ratio: 1, time: 408 ns
Alloc size: 8192, min util ratio: 1, avg util ratio: 1, time: 175 ns
Alloc size: 32768, min util ratio: 1, avg util ratio: 1, time: 130 ns
Alloc size: 131072, min util ratio: 1, avg util ratio: 1, time: 107 ns
Alloc size: 524288, min util ratio: 1, avg util ratio: 1, time: 99 ns
Alloc size: 2097152, min util ratio: 1, avg util ratio: 1, time: 100 ns
Alloc size: 8388608, min util ratio: 1, avg util ratio: 1, time: 98 ns
Alloc size: 33554432, min util ratio: 1, avg util ratio: 1, time: 98 ns

Uniform size, size equals power of 2 +/- 17

OffsetAllocator (After Optimization)

Alloc size: 15, min util ratio: 1, avg util ratio: 1, time: 568 ns
Alloc size: 111, min util ratio: 0.991071, avg util ratio: 0.991071, time: 441 ns
Alloc size: 495, min util ratio: 0.966797, avg util ratio: 0.966797, time: 178 ns
Alloc size: 2031, min util ratio: 0.991699, avg util ratio: 0.991699, time: 418 ns
Alloc size: 8175, min util ratio: 0.997925, avg util ratio: 0.997925, time: 170 ns
Alloc size: 32751, min util ratio: 0.999481, avg util ratio: 0.999481, time: 133 ns
Alloc size: 131055, min util ratio: 0.99987, avg util ratio: 0.99987, time: 109 ns
Alloc size: 524271, min util ratio: 0.999968, avg util ratio: 0.999968, time: 100 ns
Alloc size: 2097135, min util ratio: 0.999992, avg util ratio: 0.999992, time: 99 ns
Alloc size: 8388591, min util ratio: 0.999998, avg util ratio: 0.999998, time: 98 ns
Alloc size: 33554415, min util ratio: 0.999999, avg util ratio: 0.999999, time: 99 ns
Alloc size: 49, min util ratio: 0.942308, avg util ratio: 0.942308, time: 508 ns
Alloc size: 145, min util ratio: 0.906249, avg util ratio: 0.906249, time: 372 ns
Alloc size: 529, min util ratio: 0.918399, avg util ratio: 0.918399, time: 172 ns
Alloc size: 2065, min util ratio: 0.896267, avg util ratio: 0.896267, time: 403 ns
Alloc size: 8209, min util ratio: 0.89073, avg util ratio: 0.89073, time: 174 ns
Alloc size: 32785, min util ratio: 0.889347, avg util ratio: 0.889347, time: 131 ns
Alloc size: 131089, min util ratio: 0.88897, avg util ratio: 0.88897, time: 105 ns
Alloc size: 524305, min util ratio: 0.888701, avg util ratio: 0.888701, time: 102 ns
Alloc size: 2097169, min util ratio: 0.888679, avg util ratio: 0.888679, time: 100 ns
Alloc size: 8388625, min util ratio: 0.886721, avg util ratio: 0.886721, time: 100 ns
Alloc size: 33554449, min util ratio: 0.875, avg util ratio: 0.875, time: 100 ns

OffsetAllocator (Before Optimization)

Alloc size: 15, min util ratio: 1, avg util ratio: 1, time: 566 ns
Alloc size: 111, min util ratio: 0.669866, avg util ratio: 0.710845, time: 703 ns
Alloc size: 495, min util ratio: 0.665779, avg util ratio: 0.676874, time: 238 ns
Alloc size: 2031, min util ratio: 0.668333, avg util ratio: 0.705411, time: 637 ns
Alloc size: 8175, min util ratio: 0.666175, avg util ratio: 0.676474, time: 242 ns
Alloc size: 32751, min util ratio: 0.664435, avg util ratio: 0.669078, time: 168 ns
Alloc size: 131055, min util ratio: 0.66062, avg util ratio: 0.667341, time: 124 ns
Alloc size: 524271, min util ratio: 0.653055, avg util ratio: 0.666993, time: 118 ns
Alloc size: 2097135, min util ratio: 0.64062, avg util ratio: 0.666873, time: 116 ns
Alloc size: 8388591, min util ratio: 0.605468, avg util ratio: 0.667812, time: 115 ns
Alloc size: 33554415, min util ratio: 0.5625, avg util ratio: 0.670944, time: 116 ns
Alloc size: 49, min util ratio: 0.692229, avg util ratio: 0.753062, time: 1122 ns
Alloc size: 145, min util ratio: 0.667789, avg util ratio: 0.700907, time: 572 ns
Alloc size: 529, min util ratio: 0.66577, avg util ratio: 0.676238, time: 238 ns
Alloc size: 2065, min util ratio: 0.667926, avg util ratio: 0.704884, time: 632 ns
Alloc size: 8209, min util ratio: 0.665708, avg util ratio: 0.676372, time: 239 ns
Alloc size: 32785, min util ratio: 0.664224, avg util ratio: 0.669058, time: 168 ns
Alloc size: 131089, min util ratio: 0.659631, avg util ratio: 0.667287, time: 129 ns
Alloc size: 524305, min util ratio: 0.652609, avg util ratio: 0.666884, time: 122 ns
Alloc size: 2097169, min util ratio: 0.638677, avg util ratio: 0.666516, time: 120 ns
Alloc size: 8388625, min util ratio: 0.60547, avg util ratio: 0.665131, time: 121 ns
Alloc size: 33554449, min util ratio: 0.546875, avg util ratio: 0.660917, time: 120 ns

Uniform size, size equals power of 2 multiply 0.9 or 1.1

OffsetAllocator (After Optimization)

Alloc size: 28, min util ratio: 1, avg util ratio: 1, time: 543 ns
Alloc size: 115, min util ratio: 0.958333, avg util ratio: 0.958333, time: 418 ns
Alloc size: 460, min util ratio: 0.958332, avg util ratio: 0.958332, time: 189 ns
Alloc size: 1843, min util ratio: 0.959896, avg util ratio: 0.959896, time: 418 ns
Alloc size: 7372, min util ratio: 0.959895, avg util ratio: 0.959895, time: 197 ns
Alloc size: 29491, min util ratio: 0.959993, avg util ratio: 0.959993, time: 135 ns
Alloc size: 117964, min util ratio: 0.959979, avg util ratio: 0.959979, time: 111 ns
Alloc size: 471859, min util ratio: 0.959985, avg util ratio: 0.959985, time: 100 ns
Alloc size: 1887436, min util ratio: 0.959765, avg util ratio: 0.959765, time: 99 ns
Alloc size: 7549747, min util ratio: 0.959766, avg util ratio: 0.959766, time: 99 ns
Alloc size: 30198988, min util ratio: 0.95625, avg util ratio: 0.95625, time: 99 ns
Alloc size: 35, min util ratio: 0.972222, avg util ratio: 0.972222, time: 531 ns
Alloc size: 140, min util ratio: 0.972222, avg util ratio: 0.972222, time: 397 ns
Alloc size: 563, min util ratio: 0.977427, avg util ratio: 0.977427, time: 180 ns
Alloc size: 2252, min util ratio: 0.97743, avg util ratio: 0.97743, time: 389 ns
Alloc size: 9011, min util ratio: 0.977752, avg util ratio: 0.977752, time: 183 ns
Alloc size: 36044, min util ratio: 0.977752, avg util ratio: 0.977752, time: 133 ns
Alloc size: 144179, min util ratio: 0.977739, avg util ratio: 0.977739, time: 106 ns
Alloc size: 576716, min util ratio: 0.977538, avg util ratio: 0.977538, time: 103 ns
Alloc size: 2306867, min util ratio: 0.977539, avg util ratio: 0.977539, time: 99 ns
Alloc size: 9227468, min util ratio: 0.975391, avg util ratio: 0.975391, time: 99 ns
Alloc size: 36909875, min util ratio: 0.9625, avg util ratio: 0.9625, time: 100 ns

OffsetAllocator (Before Optimization)

Alloc size: 28, min util ratio: 1, avg util ratio: 1, time: 539 ns
Alloc size: 115, min util ratio: 0.669299, avg util ratio: 0.709245, time: 701 ns
Alloc size: 460, min util ratio: 0.665825, avg util ratio: 0.677532, time: 255 ns
Alloc size: 1843, min util ratio: 0.669352, avg util ratio: 0.709202, time: 691 ns
Alloc size: 7372, min util ratio: 0.66619, avg util ratio: 0.677401, time: 260 ns
Alloc size: 29491, min util ratio: 0.664311, avg util ratio: 0.669511, time: 172 ns
Alloc size: 117964, min util ratio: 0.661812, avg util ratio: 0.667356, time: 133 ns
Alloc size: 471859, min util ratio: 0.654345, avg util ratio: 0.667048, time: 123 ns
Alloc size: 1887436, min util ratio: 0.640722, avg util ratio: 0.666447, time: 121 ns
Alloc size: 7549747, min util ratio: 0.611719, avg util ratio: 0.666847, time: 119 ns
Alloc size: 30198988, min util ratio: 0.548437, avg util ratio: 0.669799, time: 125 ns
Alloc size: 35, min util ratio: 0.7098, avg util ratio: 0.774162, time: 1306 ns
Alloc size: 140, min util ratio: 0.667934, avg util ratio: 0.702151, time: 599 ns
Alloc size: 563, min util ratio: 0.665599, avg util ratio: 0.675548, time: 239 ns
Alloc size: 2252, min util ratio: 0.667371, avg util ratio: 0.701623, time: 601 ns
Alloc size: 9011, min util ratio: 0.665485, avg util ratio: 0.675528, time: 244 ns
Alloc size: 36044, min util ratio: 0.663248, avg util ratio: 0.668912, time: 170 ns
Alloc size: 144179, min util ratio: 0.660308, avg util ratio: 0.666934, time: 127 ns
Alloc size: 576716, min util ratio: 0.654467, avg util ratio: 0.66679, time: 122 ns
Alloc size: 2306867, min util ratio: 0.633789, avg util ratio: 0.666159, time: 121 ns
Alloc size: 9227468, min util ratio: 0.597266, avg util ratio: 0.666037, time: 118 ns
Alloc size: 36909875, min util ratio: 0.55, avg util ratio: 0.669564, time: 121 ns

Random Size

OffsetAllocator (After Optimization)

util ratio (min / p99 / p90 / p50 / max / avg): 0.544250 / 0.713338 / 0.779739 / 0.847867 / 0.952591 / 0.841576
avg alloc time: 145.575738 ns/op

OffsetAllocator (Before Optimization)

util ratio (min / p99 / p90 / p50 / max / avg): 0.569255 / 0.712076 / 0.781224 / 0.855046 / 0.976057 / 0.848873
avg alloc time: 142.508508 ns/op

xiaguan and others added 15 commits July 15, 2025 17:44
- remove SliceGuard and SliceBuffer RAII classes
- replace SimpleAllocator with offset_allocator::Allocator
- use AllocationHandle for automatic memory management
- simplify buffer allocation with single contiguous allocations
- add offset_allocator implementation in new directory
- update CMakeLists to include new allocator source files
- add comprehensive test suite for offset_allocator

Signed-off-by: Jinyang Su <[email protected]>
@ykwd ykwd requested a review from xiaguan July 18, 2025 04:25
@xiaguan
Copy link
Collaborator

xiaguan commented Jul 18, 2025

One thing is important, related to a lifetime bug (#639). I believe we could accept a unique_ptr resource with the allocator, as they should share the same lifetime. And we can hold a shared_ptr in the handle. This way, we can ensure the handle is always valid?

@ykwd
Copy link
Collaborator Author

ykwd commented Jul 18, 2025

One thing is important, related to a lifetime bug (#639). I believe we could accept a unique_ptr resource with the allocator, as they should share the same lifetime. And we can hold a shared_ptr in the handle. This way, we can ensure the handle is always valid?

This bug occurs because SliceBuffer holds a reference to DistributedObjectStore and attempts to release memory without being aware of its lifetime. The proposed fix adds complexity by introducing a three-layer if condition check. In contrast, by following the design of AllocatedBuffer, using a weak_ptr is sufficient to prevent double-freeing memory after the allocator has been destroyed. As for handle validation, it should depend on the lifetime of the actual underlying memory, which is not managed by the allocator itself.

@xiaguan
Copy link
Collaborator

xiaguan commented Jul 18, 2025

One thing is important, related to a lifetime bug (#639). I believe we could accept a unique_ptr resource with the allocator, as they should share the same lifetime. And we can hold a shared_ptr in the handle. This way, we can ensure the handle is always valid?

This bug occurs because SliceBuffer holds a reference to DistributedObjectStore and attempts to release memory without being aware of its lifetime. The proposed fix adds complexity by introducing a three-layer if condition check. In contrast, by following the design of AllocatedBuffer, using a weak_ptr is sufficient to prevent double-freeing memory after the allocator has been destroyed. As for handle validation, it should depend on the lifetime of the actual underlying memory, which is not managed by the allocator itself.

Fine, so I would wrap the allocator to bind any managed allocator and handle the lifetime of the resources.

Copy link
Collaborator

@xiaguan xiaguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiaguan xiaguan requested a review from stmatengss July 18, 2025 07:13
@ykwd
Copy link
Collaborator Author

ykwd commented Jul 18, 2025

One thing is important, related to a lifetime bug (#639). I believe we could accept a unique_ptr resource with the allocator, as they should share the same lifetime. And we can hold a shared_ptr in the handle. This way, we can ensure the handle is always valid?

This bug occurs because SliceBuffer holds a reference to DistributedObjectStore and attempts to release memory without being aware of its lifetime. The proposed fix adds complexity by introducing a three-layer if condition check. In contrast, by following the design of AllocatedBuffer, using a weak_ptr is sufficient to prevent double-freeing memory after the allocator has been destroyed. As for handle validation, it should depend on the lifetime of the actual underlying memory, which is not managed by the allocator itself.

Fine, so I would wrap the allocator to bind any managed allocator and handle the lifetime of the resources.

Separating the responsibilities of the Allocator from the actual allocation and deallocation of physical memory via malloc is reasonable. On both the Master and Client sides, the lifecycle of the Allocator may not align with the lifecycle of the actual physical memory.

For example:

  • On the Master side, the Allocator may manage memory that was allocated by a Client.

  • On the Client side, the Allocator may manage memory passed down from an upper layer via register_buffer.

Currently, memory allocated by the Client itself is managed by SegmentDeleter, but any other memory (such as that allocated elsewhere) must still be managed by whoever performed the malloc for it.

@ykwd ykwd merged commit 4c825d6 into kvcache-ai:main Jul 18, 2025
10 checks passed
@@ -0,0 +1,176 @@
# Allocator Memory Utilization Benchmark
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Move it to doc dir?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Will do in next PR

@ykwd ykwd mentioned this pull request Aug 22, 2025
@ykwd ykwd deleted the offset_allocator branch September 25, 2025 03:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants