-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Optimize muti-column grouping with StringView/ByteView (option 2) - 25% faster #19413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize muti-column grouping with StringView/ByteView (option 2) - 25% faster #19413
Conversation
| } | ||
| } | ||
|
|
||
| fn value(&self, buffer_index: usize, offset: usize, length: usize) -> &[u8] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I inlined this function at the callsite and converted it to unsafe - I felt it was easier to understand why this is safe when it was next to its use
| let has_nulls = array.null_count() != 0; | ||
| let array = array.as_byte_view::<B>(); | ||
| let has_buffers = !array.data_buffers().is_empty(); | ||
| // call specialized version based on nulls and buffers presence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the key optimization -- create specialized versions of the comparison based on properties of the input array
This comment was marked as outdated.
This comment was marked as outdated.
|
🤖 |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
run benchmark clickbench_partitioned |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
Which issue does this PR close?
Note this is an alternate to #19364
Rationale for this change
@camuel found a query where DuckDB's raw grouping is is faster.
I looked into it and much of the difference can be explained by better vectorization in the comparisons and short string optimizations
What changes are included in this PR?
Optimize (will comment inline)
Are these changes tested?
By CI. See also benchmark results below. I tested manually as well
Create Data:
Run query:
hyperfine --warmup 3 " datafusion-cli -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" "Before (main): 1.368s
Benchmark 1: datafusion-cli -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;" Time (mean ± σ): 1.393 s ± 0.020 s [User: 16.778 s, System: 0.688 s] Range (min … max): 1.368 s … 1.438 s 10 runsAfter (this PR) 1.022s
Benchmark 1: ./datafusion-cli-multi-gby-try2 -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;" Time (mean ± σ): 1.022 s ± 0.015 s [User: 11.685 s, System: 0.644 s] Range (min … max): 1.005 s … 1.052 s 10 runsI have a PR that improves string view hashing performance too, see
Are there any user-facing changes?
Faster performance