Skip to content

Conversation

staryxchen
Copy link
Contributor

@staryxchen staryxchen commented Sep 17, 2025

Summary

This PR introduces a configurable parallel memory region registration feature with significant performance improvements for pre-allocated memory scenarios, while maintaining backward compatibility.

I conducted several tests to validate performance (test code is also attached).
perf.tar.gz

Test Configuration

  • Memory Size: 500GB (DRAM)
  • Test Scenarios:
    • Pre-allocated memory (memory allocated and initialized)
    • Non-pre-allocated memory (memory allocated but not initialized)
  • Configuration Options:
    • With Optimization: MC_DISABLE_PARALLEL_REG_MR not set (parallel registration enabled)
    • Without Optimization: MC_DISABLE_PARALLEL_REG_MR=1 (sequential registration)

Performance Results

Pre-allocated Memory Scenario

Operation With Optimization (Parallel) Without Optimization (Sequential) Performance Improvement
allocate_memory 3.557 seconds 3.810 seconds N/A
register_memory 11.662 seconds 49.198 seconds 321.9% faster
unregister_memory 1.459 seconds 11.754 seconds 705.4% faster
Total Time 17.353 seconds 65.399 seconds 276.8% faster

Non-pre-allocated Memory Scenario

Operation With Optimization (Parallel) Without Optimization (Sequential) Performance Improvement
allocate_memory 0.000 seconds 0.001 seconds N/A
register_memory 461.999 seconds 86.930 seconds -431.3% slower
unregister_memory 1.822 seconds 11.498 seconds 531.0% faster
Total Time 464.469 seconds 98.945 seconds -369.4% slower

I had the AI summarize and analyze the test results. Below is the AI's output:

Key Performance Findings

Pre-allocated Memory (500GB)

  • Memory registration: 4.2x faster (11.7s → 49.2s)
  • Memory unregistration: 8.1x faster (1.5s → 11.8s)
  • Total operation time: 2.8x faster (17.4s → 65.4s)

Non-pre-allocated Memory (500GB)

  • Memory registration: 5.3x slower (462s → 87s)
  • Memory unregistration: 6.3x faster (1.8s → 11.5s)
  • Total operation time: 3.7x slower (464s → 99s)

Analysis

Pre-allocated memory benefits from parallel registration because:

  • Memory is already pinned and in physical memory
  • Multiple RDMA contexts can be utilized simultaneously
  • Better CPU core utilization

Non-pre-allocated memory performs better with sequential registration because:

  • Reduces memory paging overhead and I/O contention
  • Avoids kernel-level resource conflicts during large allocations

Conclusion

The parallel memory registration optimization provides significant performance benefits for pre-allocated memory scenarios, with up to 8x improvement in unregistration performance. However, for large non-pre-allocated memory allocations, sequential registration performs better due to reduced resource contention and kernel overhead.

The MC_PARALLEL_REG_MR configuration option provides the flexibility to choose the optimal strategy based on the specific use case and memory allocation patterns of the application.

- Add ``parallel_reg_mr`` config option with environment variable control
- Implement parallel registration/unregistration using std::async
- Maintain backward compatibility with sequential mode

Signed-off-by: staryxchen <[email protected]>
@staryxchen
Copy link
Contributor Author

Related issue: #848

@xiaguan
Copy link
Collaborator

xiaguan commented Sep 18, 2025

The zip file you provided seems to be empty?

@staryxchen
Copy link
Contributor Author

The zip file you provided seems to be empty?

Sry, I've updated the file. Please try again.

…sabled

- Default value of parallel_reg_mr changed from true to false
- Environment variable switched from MC_DISABLE_PARALLEL_REG_MR to MC_ENABLE_PARALLEL_REG_MR

Signed-off-by: staryxchen <[email protected]>
@staryxchen
Copy link
Contributor Author

Hi @xiaguan
Since this patch may cause negative effects (if register memory is not pre-allocated), I have disabled this optimization by default.
BTW, my tests show that pre-allocating memory via touch-read does not eliminate the negative optimization caused by parallel register MR. The bottleneck appears to stem from pin memory. Could you further verify the optimization effect of this patch when combined with the Mooncake Store?

@xiaguan
Copy link
Collaborator

xiaguan commented Sep 22, 2025

Sure, I'll give it a try. I'll share the results later.

@xiaguan
Copy link
Collaborator

xiaguan commented Sep 22, 2025

In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression.

Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.

@staryxchen
Copy link
Contributor Author

In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression.

Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.

What is the size of the registered memory in your test?

@xiaguan
Copy link
Collaborator

xiaguan commented Sep 22, 2025

In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression.
Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.

What is the size of the registered memory in your test?

(40GB, 4GB)

@staryxchen
Copy link
Contributor Author

In the simple dual-NIC test setup, registration speed doesn't seem to improve with or without pre-allocation—there's actually a bit of regression.
Not sure how it performs on 8 NICs yet. I'll test it once I get access to such a machine. In the meantime, feel free to keep this PR open.

What is the size of the registered memory in your test?

(40GB, 4GB)

I think the size is not enough to show the improvements of this patch. Could we try a larger capacity, like 400GB?

@xiaguan
Copy link
Collaborator

xiaguan commented Sep 22, 2025

8nic, 200GB without pre alloc
default

I0922 08:58:12.550942 32380 rdma_transport.cpp:143] Memory registration took 71332.3 ms

with this pr

I0922 08:56:06.719657 30195 rdma_transport.cpp:143] Memory registration took 420163 ms

pre alloc
default

I0922 09:01:26.652885 33920 rdma_transport.cpp:143] Memory registration took 29864.5 ms

with this pr

I0922 09:04:40.057971 34793 rdma_transport.cpp:143] Memory registration took 81101.3 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants