-
Notifications
You must be signed in to change notification settings - Fork 387
[Store] Zero copy for Mooncake store get_tensor API #800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Although zero-copy is generally desirable for better performance, I’m concerned that invoking |
I roughly understand the optimization - good feat! I have some questions though. In this case, who's responsible for releasing the memory allocated with new? |
If we don't need ownership of the tensor, can we actually achieve zero-copy through |
In the get_tensor case, the py::array free_when_done will take responsible for free the memory:
Since we get_allocated_internal is not exposed to python users, so we don't need to worry that. For any other API such as get_tensor which calls get_allocated_internal to get an allocated memory, the caller needs to take care of free the returned memory like what get_tensor does. |
I am not sure how heavy is the register_buffer_internal call. I saw put_tensor calls register_buffer_internal for every put, I thought it should be somewhat light weight. |
I don't understand what it means "we don't need ownership of the tensor". The tensor is just a python object returned. I don't see any ownership concept here. Would you please clarify? I didn't see any other way to achieve zero-copy beyond "reading directly into the buffer which back the tensor memory" |
One question to Mooncake experts related to register_buffer, I saw the following code for put_tensor:
register_buffer_internal is called for buffer pointer (tensor data ptr), but register_buffer_internal is not called for metadata_buffer. Both of them are used in put_parts_internal call. Will it cause potential problem? What is the principle to call register_buffer_internal? |
here's a example using for read tensor buffer = store.get_buffer(key)
retrieved_view = memoryview(buffer)
metadata_bytes = retrieved_view[:METADATA_BYTES_LEN]
metadata = Metadata.deserialize(metadata_bytes)
temp_tensor = torch.frombuffer(
buffer,
dtype=metadata.dtype,
offset=METADATA_BYTES_LEN,
count=num_elements,
).reshape(metadata.shape) for write a tensor metadata_bytes = RemoteMetadata(
len(kv_bytes), kv_shape, kv_dtype, memory_format
).serialize()
assert len(metadata_bytes) == METADATA_BYTES_LEN
self.store.put_parts(key_str, metadata_bytes, kv_bytes) |
for more usage, you could check out https://github.com/LMCache/LMCache/blob/525618e760bc7809b3bb09658497d75d55667f97/lmcache/v1/storage_backend/connector/mooncakestore_connector.py |
I am surprised to know that the As for the cost of |
I understand. So even the current put_tensor implementation is somewhat not in the right way using register_buffer_internal. Just as you said, put_parts_internal actually memcpy the data, there is no need at all for calling register_buffer_internal since there is a data copy into already registered buffer. (So for put_tensor, if the tensor memory is not backed by registered buffer, a data copy seems unavoidable.) While for get tensor side , @xiaguan provide a method to use registered buffer to back a tensor. This tensor will share the client buffer memory space. This is the choice to realize zero copy for get tensor if sharing the client buffer space is desired. Per discussion by now, we should avoid per get or put call of register_buffer_internal and reuse the client buffer memory space to realize Zero copy. Otherwise a memcpy has to be used. |
For the write path, we could memcpy metadata to the client and register the data pointer to RDMA (particularly when dealing with exceptionally large tensors). |
As far as I know, the registering algorithm also has linear time complexity with respect to the length of the tensor. At startup, this becomes noticeable: if the configured segment size is large, the mount segment step can take a long time. |
It seems that implementing Zero copy tensor put and get depend on very much where the tensor memory space resides:
The current get_tensor and put_tensor takes is doing the data copy approach so that the tensor memory reside out of the client local buffer. It is still a valid option as described above. While for the tensor memory space which reside in client local buffer, it seems that we currently can use a combination of existing APIs to realize (such as, get_buffer, put_from, ..., Need to think about whether any API missing to realize such a functionality.) For this case, we can either wrapper a separate pair of tensor getting and putting API or we simply don't provide it at all (leave user to wrapper in python level) Any comments? Anyway, we need to remove the register buffer call for the current put_tensor since the put parts copy the data. I submitted: #803 |
Register cost is marginally lower than memcpy, provided memory doesn't trigger page faults. Let's summarize: For write path:
For read path:
|
I’m still concerned about the performance implications of
|
You could simply benchmark it. In my personal benchmarks, when not dealing with page faults, it outperforms memcpy. I suspect it would be even faster with hugepages. Does it incur additional overhead? Yes. The remaining issues are matters of implementation and usage patterns. |
It’d be great if you could share the benchmark results and potential use cases with us. In real deployments, frequent large-scale calls to this interface may turn the metadata server into a single-point bottleneck, since it does not have the same multi-threaded processing mechanism as the master. This limitation could restrict its practical use cases and make the benefit of adding this option rather marginal. |
I agree that register memory is costly and should be avoided. Each registration incurs an additional system call. However, for large tensors, using register memory instead of memcpy might be more efficient. |
Could you update this PR and related benchmarks? THX! @jerrychenhf |
Fix #799
To realize zero copy, we allocated a new buffer after getting the total length. The buffer is registered directly to transfer engine to read data directly into the buffer. The buffer is returned with ownership transferred to the caller. It will be used to create the tensor. For any error cases, the buffer needs to be freed until it is used for tensor (numpy array).