[Store] Zero copy for Mooncake store get_tensor API #800

jerrychenhf · 2025-09-02T10:57:43Z

To realize zero copy, we allocated a new buffer after getting the total length. The buffer is registered directly to transfer engine to read data directly into the buffer. The buffer is returned with ownership transferred to the caller. It will be used to create the tensor. For any error cases, the buffer needs to be freed until it is used for tensor (numpy array).

ykwd · 2025-09-02T12:13:14Z

Although zero-copy is generally desirable for better performance, I’m concerned that invoking register_buffer_internal each time a tensor is retrieved may add more overhead than the potential benefits the zero-copy can provide.

xiaguan · 2025-09-02T12:47:11Z

I roughly understand the optimization - good feat!

I have some questions though. In this case, who's responsible for releasing the memory allocated with new?

xiaguan · 2025-09-02T12:49:20Z

If we don't need ownership of the tensor, can we actually achieve zero-copy through get_tensor in python code?

jerrychenhf · 2025-09-03T02:01:20Z

I roughly understand the optimization - good feat!

I have some questions though. In this case, who's responsible for releasing the memory allocated with new?

In the get_tensor case, the py::array free_when_done will take responsible for free the memory:

template <typename T>
py::array create_typed_array(char *exported_data, size_t offset,
                             size_t total_length) {
    py::capsule free_when_done(
        exported_data, [](void *p) { delete[] static_cast<char *>(p); });
    return py::array_t<T>({static_cast<ssize_t>(total_length / sizeof(T))},
                          (T *)(exported_data + offset), free_when_done);
}

Since we get_allocated_internal is not exposed to python users, so we don't need to worry that. For any other API such as get_tensor which calls get_allocated_internal to get an allocated memory, the caller needs to take care of free the returned memory like what get_tensor does.

jerrychenhf · 2025-09-03T02:03:26Z

Although zero-copy is generally desirable for better performance, I’m concerned that invoking register_buffer_internal each time a tensor is retrieved may add more overhead than the potential benefits the zero-copy can provide.

I am not sure how heavy is the register_buffer_internal call. I saw put_tensor calls register_buffer_internal for every put, I thought it should be somewhat light weight.
Maybe other experts on Mooncake can address this question?

jerrychenhf · 2025-09-03T02:08:49Z

If we don't need ownership of the tensor, can we actually achieve zero-copy through get_tensor in python code?

I don't understand what it means "we don't need ownership of the tensor". The tensor is just a python object returned. I don't see any ownership concept here. Would you please clarify?

I didn't see any other way to achieve zero-copy beyond "reading directly into the buffer which back the tensor memory"

jerrychenhf · 2025-09-03T02:12:57Z

One question to Mooncake experts related to register_buffer, I saw the following code for put_tensor:

            char *buffer = reinterpret_cast<char *>(data_ptr);
            char *metadata_buffer = reinterpret_cast<char *>(&metadata);
            std::vector<std::span<const char>> values;
            values.emplace_back(
                std::span<const char>(metadata_buffer, sizeof(TensorMetadata)));
            values.emplace_back(std::span<const char>(buffer, tensor_size));

            auto register_result = store_.register_buffer_internal(
                reinterpret_cast<void *>(data_ptr), tensor_size);
            if (!register_result) {
                return -static_cast<int>(register_result.error());
            }

            // Use put_parts to put metadata and tensor together
            auto put_result = store_.put_parts_internal(key, values);

register_buffer_internal is called for buffer pointer (tensor data ptr), but register_buffer_internal is not called for metadata_buffer. Both of them are used in put_parts_internal call.

Will it cause potential problem? What is the principle to call register_buffer_internal?

xiaguan · 2025-09-03T02:25:49Z

here's a example using get_buffer for zero copy and zero register

for read tensor

buffer = store.get_buffer(key)
retrieved_view = memoryview(buffer)
metadata_bytes = retrieved_view[:METADATA_BYTES_LEN]
metadata = Metadata.deserialize(metadata_bytes)
temp_tensor = torch.frombuffer(
                buffer,
                dtype=metadata.dtype,
                offset=METADATA_BYTES_LEN,
                count=num_elements,
            ).reshape(metadata.shape)

for write a tensor

metadata_bytes = RemoteMetadata(
                len(kv_bytes), kv_shape, kv_dtype, memory_format
            ).serialize()
            assert len(metadata_bytes) == METADATA_BYTES_LEN
self.store.put_parts(key_str, metadata_bytes, kv_bytes)

xiaguan · 2025-09-03T02:26:52Z

for more usage, you could check out https://github.com/LMCache/LMCache/blob/525618e760bc7809b3bb09658497d75d55667f97/lmcache/v1/storage_backend/connector/mooncakestore_connector.py

ykwd · 2025-09-03T02:28:27Z

Although zero-copy is generally desirable for better performance, I’m concerned that invoking register_buffer_internal each time a tensor is retrieved may add more overhead than the potential benefits the zero-copy can provide.

I am not sure how heavy is the register_buffer_internal call. I saw put_tensor calls register_buffer_internal for every put, I thought it should be somewhat light weight. Maybe other experts on Mooncake can address this question?

I am surprised to know that the put_tensor calls register_buffer_internal every time. Additionally, put_tensor calls put_parts_internal, but put_parts_internal uses memcpy instead of zero copy. So the registered buffer has not been used at all.

As for the cost of register_buffer_internal, though there is no formal benchmark yet, I think it is not small, especially compared with memory copy. This function checks all the installed transports, calls ibv_reg_mr() for each RDMA context/device, and updates memory buffer metadata. The unregistering has a similar but reverse process. This method is not supposed to be called frequently for performance issues.

jerrychenhf · 2025-09-03T10:02:17Z

Although zero-copy is generally desirable for better performance, I’m concerned that invoking register_buffer_internal each time a tensor is retrieved may add more overhead than the potential benefits the zero-copy can provide.

I am not sure how heavy is the register_buffer_internal call. I saw put_tensor calls register_buffer_internal for every put, I thought it should be somewhat light weight. Maybe other experts on Mooncake can address this question?

I am surprised to know that the put_tensor calls register_buffer_internal every time. Additionally, put_tensor calls put_parts_internal, but put_parts_internal uses memcpy instead of zero copy. So the registered buffer has not been used at all.

As for the cost of register_buffer_internal, though there is no formal benchmark yet, I think it is not small, especially compared with memory copy. This function checks all the installed transports, calls ibv_reg_mr() for each RDMA context/device, and updates memory buffer metadata. The unregistering has a similar but reverse process. This method is not supposed to be called frequently for performance issues.

I understand. So even the current put_tensor implementation is somewhat not in the right way using register_buffer_internal. Just as you said, put_parts_internal actually memcpy the data, there is no need at all for calling register_buffer_internal since there is a data copy into already registered buffer. (So for put_tensor, if the tensor memory is not backed by registered buffer, a data copy seems unavoidable.)

While for get tensor side , @xiaguan provide a method to use registered buffer to back a tensor. This tensor will share the client buffer memory space. This is the choice to realize zero copy for get tensor if sharing the client buffer space is desired.

Per discussion by now, we should avoid per get or put call of register_buffer_internal and reuse the client buffer memory space to realize Zero copy. Otherwise a memcpy has to be used.

xiaguan · 2025-09-04T02:28:14Z

For the write path, we could memcpy metadata to the client and register the data pointer to RDMA (particularly when dealing with exceptionally large tensors).

ykwd · 2025-09-04T03:02:49Z

For the write path, we could memcpy metadata to the client and register the data pointer to RDMA (particularly when dealing with exceptionally large tensors).

As far as I know, the registering algorithm also has linear time complexity with respect to the length of the tensor. At startup, this becomes noticeable: if the configured segment size is large, the mount segment step can take a long time.

jerrychenhf · 2025-09-04T04:16:24Z

It seems that implementing Zero copy tensor put and get depend on very much where the tensor memory space resides:

If the tensor memory space can reside in client local buffer, put and get with Zero copy is possible without register buffer call (since it is already registered). (This is the preferred situation for performance)
If the tensor memory reside out side of client local buffer, two choices:

Use register buffer with zero copy (This may not preferred solution if register buffer is too heavy weight)
Copy the data without register buffer

The current get_tensor and put_tensor takes is doing the data copy approach so that the tensor memory reside out of the client local buffer. It is still a valid option as described above.

While for the tensor memory space which reside in client local buffer, it seems that we currently can use a combination of existing APIs to realize (such as, get_buffer, put_from, ..., Need to think about whether any API missing to realize such a functionality.)

For this case, we can either wrapper a separate pair of tensor getting and putting API or we simply don't provide it at all (leave user to wrapper in python level)

Any comments?

Anyway, we need to remove the register buffer call for the current put_tensor since the put parts copy the data. I submitted: #803

xiaguan · 2025-09-04T05:07:13Z

Register cost is marginally lower than memcpy, provided memory doesn't trigger page faults.

Let's summarize:

For write path:

Pass a pointer that's already registered beforehand (put_from)
Pass a pointer then copy to registered internal buffer (put)
Pass a pointer and register it (not supported yet)

For read path:

Directly read data into a pre-registered pointer (get_into)
Return internal buffer to caller (get_buffer)
Register and read (not supported)
Read, copy, then return (get)

ykwd · 2025-09-04T06:21:20Z

I’m still concerned about the performance implications of RegisterLocalMemory for the following reasons:

The function involves several system calls and relies on a linear-time scan algorithm, yet we currently have no benchmark results to assess its overhead.
It sends a request to the metadata server to update metadata, but the metadata server handles all updates with a simple mutex. In large-scale data processing scenarios, this could easily become a performance bottleneck.

xiaguan · 2025-09-04T06:30:49Z

You could simply benchmark it. In my personal benchmarks, when not dealing with page faults, it outperforms memcpy. I suspect it would be even faster with hugepages.

Does it incur additional overhead? Yes.
Does it outperform memcpy? That depends on the workload.
Is this feature reasonable? Yes.

The remaining issues are matters of implementation and usage patterns.

ykwd · 2025-09-04T06:39:42Z

It’d be great if you could share the benchmark results and potential use cases with us.

In real deployments, frequent large-scale calls to this interface may turn the metadata server into a single-point bottleneck, since it does not have the same multi-threaded processing mechanism as the master. This limitation could restrict its practical use cases and make the benefit of adding this option rather marginal.

stmatengss · 2025-09-05T07:06:09Z

Although zero-copy is generally desirable for better performance, I’m concerned that invoking register_buffer_internal each time a tensor is retrieved may add more overhead than the potential benefits the zero-copy can provide.

I am not sure how heavy is the register_buffer_internal call. I saw put_tensor calls register_buffer_internal for every put, I thought it should be somewhat light weight. Maybe other experts on Mooncake can address this question?

I am surprised to know that the put_tensor calls register_buffer_internal every time. Additionally, put_tensor calls put_parts_internal, but put_parts_internal uses memcpy instead of zero copy. So the registered buffer has not been used at all.
As for the cost of register_buffer_internal, though there is no formal benchmark yet, I think it is not small, especially compared with memory copy. This function checks all the installed transports, calls ibv_reg_mr() for each RDMA context/device, and updates memory buffer metadata. The unregistering has a similar but reverse process. This method is not supposed to be called frequently for performance issues.

I understand. So even the current put_tensor implementation is somewhat not in the right way using register_buffer_internal. Just as you said, put_parts_internal actually memcpy the data, there is no need at all for calling register_buffer_internal since there is a data copy into already registered buffer. (So for put_tensor, if the tensor memory is not backed by registered buffer, a data copy seems unavoidable.)

While for get tensor side , @xiaguan provide a method to use registered buffer to back a tensor. This tensor will share the client buffer memory space. This is the choice to realize zero copy for get tensor if sharing the client buffer space is desired.

Per discussion by now, we should avoid per get or put call of register_buffer_internal and reuse the client buffer memory space to realize Zero copy. Otherwise a memcpy has to be used.

I agree that register memory is costly and should be avoided. Each registration incurs an additional system call. However, for large tensors, using register memory instead of memcpy might be more efficient.

stmatengss · 2025-09-10T07:22:37Z

Could you update this PR and related benchmarks? THX! @jerrychenhf

jerrychenhf added 4 commits September 2, 2025 18:52

Zero copy for Mooncake store get_tensor

c834522

Delete the data ptr if register buffer failed

9b0a892

Code formatting

7cb518a

Correct log message

d84f466

[Store] Zero copy for Mooncake store get_tensor API #800

Are you sure you want to change the base?

[Store] Zero copy for Mooncake store get_tensor API #800

Uh oh!

Conversation

jerrychenhf commented Sep 2, 2025

Uh oh!

ykwd commented Sep 2, 2025

Uh oh!

xiaguan commented Sep 2, 2025

Uh oh!

xiaguan commented Sep 2, 2025

Uh oh!

jerrychenhf commented Sep 3, 2025

Uh oh!

jerrychenhf commented Sep 3, 2025

Uh oh!

jerrychenhf commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jerrychenhf commented Sep 3, 2025

Uh oh!

xiaguan commented Sep 3, 2025

Uh oh!

xiaguan commented Sep 3, 2025

Uh oh!

ykwd commented Sep 3, 2025

Uh oh!

jerrychenhf commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaguan commented Sep 4, 2025

Uh oh!

ykwd commented Sep 4, 2025

Uh oh!

jerrychenhf commented Sep 4, 2025

Uh oh!

xiaguan commented Sep 4, 2025

Uh oh!

ykwd commented Sep 4, 2025

Uh oh!

xiaguan commented Sep 4, 2025

Uh oh!

ykwd commented Sep 4, 2025

Uh oh!

stmatengss commented Sep 5, 2025

Uh oh!

stmatengss commented Sep 10, 2025

Uh oh!

Uh oh!

jerrychenhf commented Sep 3, 2025 •

edited

Loading

jerrychenhf commented Sep 3, 2025 •

edited

Loading