Skip to content

Conversation

JigaoLuo
Copy link
Contributor

Description

For issue #18967, this PR is the first part of merging the PR Draft #18968. In this PR, I added host-pinned vector construction in vector_factories.hpp. After a careful read-through, I’ve improved the comments in this file as well.
(As discussed, I’ve also made manual changes to reduction.cuh and page_data.cu.)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@JigaoLuo JigaoLuo requested a review from a team as a code owner September 26, 2025 10:58
Copy link

copy-pr-bot bot commented Sep 26, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 26, 2025
@JigaoLuo JigaoLuo marked this pull request as draft September 26, 2025 11:00
@JigaoLuo
Copy link
Contributor Author

I’ve marked this as a draft to remind myself to run the script and count how many pageable copies this PR eliminates before merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This links back to the draft PR for reference, but covering the full change in reduction.cuh:

https://github.com/rapidsai/cudf/pull/18968/files#diff-d99740825ef0d2e73c3e8392d06ca11b229400d864913b4221f3f3626ad95f85

Comment on lines 69 to 72
auto pinned_initial = cudf::detail::make_pinned_vector_async<OutputType>(1, stream);
pinned_initial[0] = initial_value;
using ScalarType = cudf::scalar_type_t<OutputType>;
auto result = std::make_unique<ScalarType>(initial_value, true, stream, mr);
auto result = std::make_unique<ScalarType>(pinned_initial[0], true, stream, mr);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed on Slack: assign initial_value to element zero of a pinned vector, effectively treating it like a pinned scalar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot most of the context here :(
are we passing the value by reference here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we are not passing by reference here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me bring back some context from our Slack chat. The goal is for ScalarType and cub::DeviceReduce::Reduce to copy the initial_value from host-pinned memory.

Back around August 19th in Slack, we discussed:

  • placing the initial_value in a pinned host vector of size 1
  • and then assigning the value to the first element [0].

Comment on lines 69 to 72
auto pinned_initial = cudf::detail::make_pinned_vector_async<OutputType>(1, stream);
pinned_initial[0] = initial_value;
using ScalarType = cudf::scalar_type_t<OutputType>;
auto result = std::make_unique<ScalarType>(initial_value, true, stream, mr);
auto result = std::make_unique<ScalarType>(pinned_initial[0], true, stream, mr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot most of the context here :(
are we passing the value by reference here?

@JigaoLuo JigaoLuo force-pushed the no-miss-sync-pinned-factory branch from 1f8216e to 4c8591b Compare October 3, 2025 13:40
Copy link
Contributor Author

@JigaoLuo JigaoLuo Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change the type of offsets and buff_addrs to cudf::detail::host_vector for calling write_final_offsets function. This is the only place where the function is called.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So no need to change the write_final_offsets function in cpp/src/io/parquet/page_data.cu

@JigaoLuo JigaoLuo force-pushed the no-miss-sync-pinned-factory branch from 9cb8f6f to 1bb499f Compare October 3, 2025 19:06
Comment on lines 129 to 130
auto pinned_initial = cudf::detail::make_pinned_vector_async<OutputType>(1, stream);
pinned_initial[0] = initial_value;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the pinned vector here, since cudf::detail::device_scalar will use the bounce buffer for the H2D copy anyway.

Copy link
Contributor Author

@JigaoLuo JigaoLuo Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s true—I’ll revert it. But I have one question: does cub::DeviceReduce::Reduce actually copy the initial_value from host memory?

To investigate this question, I ran experiments using both a pinned host vector and a regular one.

$ nsys profile ./REDUCTIONS_TEST 
$ nsys export --output report1.sqlite --type sqlite report1.nsys-rep
$ nsys analyze -r cuda_memcpy_async:rows=-1 report1.nsys-rep | wc -l 

I didn’t observe any difference in pageable-copy counter, which suggests that CUB avoids pageable memory internally.


  • What makes this confusing is that I recall doing a similar experiment a few months ago to pinpoint a pageable memory bottleneck. I’m fairly sure I found one and managed to eliminate it back then.
  • (I also tried reading through the CUB source code, but it gets pretty hard to follow after the dispatch logic and various specialization paths)

Comment on lines +421 to +422
auto out_buffers = cudf::detail::make_host_vector<size_type*>(0, _stream);
auto final_offsets = cudf::detail::make_host_vector<size_type>(0, _stream);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: cudf::detail::host_vector should work well as thrust::host_vector when using the cudf/RMM memory allocator.

The reason I raise this is that most existing uses of host_vector in cudf treat it as a fixed-size array. In contrast, this particular case starts with zero-sized and relies on dynamic resizing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants