feat(rdma): add parallel memory region registration support #855

staryxchen · 2025-09-17T13:16:03Z

Summary

This PR introduces a configurable parallel memory region registration feature with significant performance improvements for pre-allocated memory scenarios, while maintaining backward compatibility.

I conducted several tests to validate performance (test code is also attached).
perf.tar.gz

Test Configuration

Memory Size: 500GB (DRAM)
Test Scenarios:
- Pre-allocated memory (memory allocated and initialized)
- Non-pre-allocated memory (memory allocated but not initialized)
Configuration Options:
- With Optimization: MC_DISABLE_PARALLEL_REG_MR not set (parallel registration enabled)
- Without Optimization: MC_DISABLE_PARALLEL_REG_MR=1 (sequential registration)

Performance Results

Pre-allocated Memory Scenario

Operation	With Optimization (Parallel)	Without Optimization (Sequential)	Performance Improvement
allocate_memory	3.557 seconds	3.810 seconds	N/A
register_memory	11.662 seconds	49.198 seconds	321.9% faster
unregister_memory	1.459 seconds	11.754 seconds	705.4% faster
Total Time	17.353 seconds	65.399 seconds	276.8% faster

Non-pre-allocated Memory Scenario

Operation	With Optimization (Parallel)	Without Optimization (Sequential)	Performance Improvement
allocate_memory	0.000 seconds	0.001 seconds	N/A
register_memory	461.999 seconds	86.930 seconds	-431.3% slower
unregister_memory	1.822 seconds	11.498 seconds	531.0% faster
Total Time	464.469 seconds	98.945 seconds	-369.4% slower

I had the AI summarize and analyze the test results. Below is the AI's output:

Key Performance Findings

Pre-allocated Memory (500GB)

Memory registration: 4.2x faster (11.7s → 49.2s)
Memory unregistration: 8.1x faster (1.5s → 11.8s)
Total operation time: 2.8x faster (17.4s → 65.4s)

Non-pre-allocated Memory (500GB)

Memory registration: 5.3x slower (462s → 87s)
Memory unregistration: 6.3x faster (1.8s → 11.5s)
Total operation time: 3.7x slower (464s → 99s)

Analysis

Pre-allocated memory benefits from parallel registration because:

Memory is already pinned and in physical memory
Multiple RDMA contexts can be utilized simultaneously
Better CPU core utilization

Non-pre-allocated memory performs better with sequential registration because:

Reduces memory paging overhead and I/O contention
Avoids kernel-level resource conflicts during large allocations

Conclusion

The parallel memory registration optimization provides significant performance benefits for pre-allocated memory scenarios, with up to 8x improvement in unregistration performance. However, for large non-pre-allocated memory allocations, sequential registration performs better due to reduced resource contention and kernel overhead.

The MC_PARALLEL_REG_MR configuration option provides the flexibility to choose the optimal strategy based on the specific use case and memory allocation patterns of the application.

- Add ``parallel_reg_mr`` config option with environment variable control - Implement parallel registration/unregistration using std::async - Maintain backward compatibility with sequential mode Signed-off-by: staryxchen <[email protected]>

staryxchen · 2025-09-17T13:17:39Z

Related issue: #848

Signed-off-by: staryxchen <[email protected]>

xiaguan · 2025-09-18T02:36:27Z

The zip file you provided seems to be empty?

staryxchen · 2025-09-18T02:52:19Z

The zip file you provided seems to be empty?

Sry, I've updated the file. Please try again.

…sabled - Default value of parallel_reg_mr changed from true to false - Environment variable switched from MC_DISABLE_PARALLEL_REG_MR to MC_ENABLE_PARALLEL_REG_MR Signed-off-by: staryxchen <[email protected]>

staryxchen · 2025-09-19T07:30:57Z

Hi @xiaguan
Since this patch may cause negative effects (if register memory is not pre-allocated), I have disabled this optimization by default.
BTW, my tests show that pre-allocating memory via touch-read does not eliminate the negative optimization caused by parallel register MR. The bottleneck appears to stem from pin memory. Could you further verify the optimization effect of this patch when combined with the Mooncake Store?

xiaguan · 2025-09-22T02:20:49Z

Sure, I'll give it a try. I'll share the results later.

xiaguan · 2025-09-22T08:15:40Z

In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression.

Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.

staryxchen · 2025-09-22T08:19:27Z

In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression.

Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.

What is the size of the registered memory in your test?

xiaguan · 2025-09-22T08:27:48Z

In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression.
Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.

What is the size of the registered memory in your test?

(40GB, 4GB)

staryxchen · 2025-09-22T08:31:06Z

In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression.
Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.

What is the size of the registered memory in your test?

(40GB, 4GB)

I think the size is not enough to show the improvements of this patch. Could we try a larger capacity, like 400GB?

xiaguan · 2025-09-22T09:06:01Z

8nic, 200GB without pre alloc
default

I0922 08:58:12.550942 32380 rdma_transport.cpp:143] Memory registration took 71332.3 ms

with this pr

I0922 08:56:06.719657 30195 rdma_transport.cpp:143] Memory registration took 420163 ms

pre alloc
default

I0922 09:01:26.652885 33920 rdma_transport.cpp:143] Memory registration took 29864.5 ms

with this pr

I0922 09:04:40.057971 34793 rdma_transport.cpp:143] Memory registration took 81101.3 ms

staryxchen mentioned this pull request Sep 17, 2025

[Feature Request]: Mooncake Store Speed up large-size segment mounting #848

Closed

1 task

format and remove some comment

0aa5f89

Signed-off-by: staryxchen <[email protected]>

fix(config): change parallel memory region registration to default di…

3015f10

…sabled - Default value of parallel_reg_mr changed from true to false - Environment variable switched from MC_DISABLE_PARALLEL_REG_MR to MC_ENABLE_PARALLEL_REG_MR Signed-off-by: staryxchen <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(rdma): add parallel memory region registration support #855

feat(rdma): add parallel memory region registration support #855

Uh oh!

staryxchen commented Sep 17, 2025 •

edited

Loading

Uh oh!

staryxchen commented Sep 17, 2025

Uh oh!

xiaguan commented Sep 18, 2025

Uh oh!

staryxchen commented Sep 18, 2025

Uh oh!

staryxchen commented Sep 19, 2025

Uh oh!

xiaguan commented Sep 22, 2025

Uh oh!

xiaguan commented Sep 22, 2025

Uh oh!

staryxchen commented Sep 22, 2025

Uh oh!

xiaguan commented Sep 22, 2025

Uh oh!

staryxchen commented Sep 22, 2025

Uh oh!

xiaguan commented Sep 22, 2025

Uh oh!

Uh oh!

feat(rdma): add parallel memory region registration support #855

Are you sure you want to change the base?

feat(rdma): add parallel memory region registration support #855

Uh oh!

Conversation

staryxchen commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Configuration

Performance Results

Pre-allocated Memory Scenario

Non-pre-allocated Memory Scenario

Key Performance Findings

Pre-allocated Memory (500GB)

Non-pre-allocated Memory (500GB)

Analysis

Conclusion

Uh oh!

staryxchen commented Sep 17, 2025

Uh oh!

xiaguan commented Sep 18, 2025

Uh oh!

staryxchen commented Sep 18, 2025

Uh oh!

staryxchen commented Sep 19, 2025

Uh oh!

xiaguan commented Sep 22, 2025

Uh oh!

xiaguan commented Sep 22, 2025

Uh oh!

staryxchen commented Sep 22, 2025

Uh oh!

xiaguan commented Sep 22, 2025

Uh oh!

staryxchen commented Sep 22, 2025

Uh oh!

xiaguan commented Sep 22, 2025

Uh oh!

Uh oh!

staryxchen commented Sep 17, 2025 •

edited

Loading