-
Notifications
You must be signed in to change notification settings - Fork 976
Remove unnecessary synchronization (miss-sync) during Parquet reading (Part 4: vector_factories) #20120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: branch-25.12
Are you sure you want to change the base?
Remove unnecessary synchronization (miss-sync) during Parquet reading (Part 4: vector_factories) #20120
Conversation
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
This reverts commit c7ad2e8.
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
I’ve marked this as a draft to remind myself to run the script and count how many pageable copies this PR eliminates before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This links back to the draft PR for reference:
https://github.com/rapidsai/cudf/pull/18968/files#diff-b281f280563cbbee7c16afb29ef989d808476a355c9c36a8f4e27fc5dc2ca4fd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This links back to the draft PR for reference, but covering the full change in reduction.cuh
:
auto pinned_initial = cudf::detail::make_pinned_vector_async<OutputType>(1, stream); | ||
pinned_initial[0] = initial_value; | ||
using ScalarType = cudf::scalar_type_t<OutputType>; | ||
auto result = std::make_unique<ScalarType>(initial_value, true, stream, mr); | ||
auto result = std::make_unique<ScalarType>(pinned_initial[0], true, stream, mr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed on Slack: assign initial_value
to element zero of a pinned vector, effectively treating it like a pinned scalar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot most of the context here :(
are we passing the value by reference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we are not passing by reference here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me bring back some context from our Slack chat. The goal is for ScalarType
and cub::DeviceReduce::Reduce
to copy the initial_value
from host-pinned memory.
Back around August 19th in Slack, we discussed:
- placing the
initial_value
in a pinned host vector of size 1 - and then assigning the value to the first element [0].
auto pinned_initial = cudf::detail::make_pinned_vector_async<OutputType>(1, stream); | ||
pinned_initial[0] = initial_value; | ||
using ScalarType = cudf::scalar_type_t<OutputType>; | ||
auto result = std::make_unique<ScalarType>(initial_value, true, stream, mr); | ||
auto result = std::make_unique<ScalarType>(pinned_initial[0], true, stream, mr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot most of the context here :(
are we passing the value by reference here?
…o no-miss-sync-pinned-factory
Signed-off-by: Jigao Luo <[email protected]>
1f8216e
to
4c8591b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change the type of offsets
and buff_addrs
to cudf::detail::host_vector
for calling write_final_offsets
function. This is the only place where the function is called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So no need to change the write_final_offsets
function in cpp/src/io/parquet/page_data.cu
Co-authored-by: Vukasin Milovanovic <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
9cb8f6f
to
1bb499f
Compare
auto pinned_initial = cudf::detail::make_pinned_vector_async<OutputType>(1, stream); | ||
pinned_initial[0] = initial_value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the pinned vector here, since cudf::detail::device_scalar will use the bounce buffer for the H2D copy anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s true—I’ll revert it. But I have one question: does cub::DeviceReduce::Reduce
actually copy the initial_value
from host memory?
To investigate this question, I ran experiments using both a pinned host vector and a regular one.
$ nsys profile ./REDUCTIONS_TEST
$ nsys export --output report1.sqlite --type sqlite report1.nsys-rep
$ nsys analyze -r cuda_memcpy_async:rows=-1 report1.nsys-rep | wc -l
I didn’t observe any difference in pageable-copy counter, which suggests that CUB avoids pageable memory internally.
- What makes this confusing is that I recall doing a similar experiment a few months ago to pinpoint a pageable memory bottleneck. I’m fairly sure I found one and managed to eliminate it back then.
- (I also tried reading through the CUB source code, but it gets pretty hard to follow after the dispatch logic and various specialization paths)
Signed-off-by: Jigao Luo <[email protected]>
Signed-off-by: Jigao Luo <[email protected]>
auto out_buffers = cudf::detail::make_host_vector<size_type*>(0, _stream); | ||
auto final_offsets = cudf::detail::make_host_vector<size_type>(0, _stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: cudf::detail::host_vector
should work well as thrust::host_vector
when using the cudf/RMM memory allocator.
The reason I raise this is that most existing uses of host_vector
in cudf treat it as a fixed-size array. In contrast, this particular case starts with zero-sized and relies on dynamic resizing
Description
For issue #18967, this PR is the first part of merging the PR Draft #18968. In this PR, I added host-pinned vector construction in
vector_factories.hpp
. After a careful read-through, I’ve improved the comments in this file as well.(As discussed, I’ve also made manual changes to
reduction.cuh
andpage_data.cu
.)Checklist