-
Notifications
You must be signed in to change notification settings - Fork 1.9k
TESTING: Optimize byte view comparison in multi groupby #19344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I think there is more performance to be had by squeezing hashing for byteview |
|
run benchmarks |
Yeah I think this could be a really nice optimization. |
datafusion/common/src/hash_utils.rs
Outdated
| for (hash, &v) in hashes_buffer.iter_mut().zip(array.views().iter()) { | ||
| let view_len = v as u32; | ||
| // if the length is not inlined, then we need to hash the bytes as well | ||
| if view_len > 12 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also eliminate this branch when having no data buffers at all.
|
🤖 |
|
🤖: Benchmark completed Details
|
|
run benchmark tpch |
|
(BTW I am testing some other improvements to hash locally) My next plan is:
|
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
I broke the first part of this PR into Will put the hash improvements in their own PR |
| return result; | ||
| result | ||
| } else { | ||
| self.do_equal_to_inner_values_only(lhs_row, array, rhs_row) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps this could be optimized as well (in the outer loop) for empty data buffers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will give it a try
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this conversation to #19364
| ); | ||
|
|
||
| // if the input array has no nulls, can skip null check | ||
| for (&lhs_row, &rhs_row, equal_to_result) in iter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can get this to compile to a well-vectorized loop.
|
I with the specialized hash function it gets even faster |
|
run benchmark tpch_mem |
|
show benchmark queue |
|
🤖 Hi @alamb, you asked to view the benchmark queue (#19344 (comment)).
|
|
Benchmark script failed with exit code 2. Last 10 lines of output: Click to expand |
1 similar comment
|
Benchmark script failed with exit code 2. Last 10 lines of output: Click to expand |
|
I have broken the changes in this PR into two chunks:
So closing this one down to work on them |
…5% faster (#19413) ## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Part of #18411 - Closes #19344 - Closes #19364 Note this is an alternate to #19364 ## Rationale for this change @camuel found a query where DuckDB's raw grouping is is faster. I looked into it and much of the difference can be explained by better vectorization in the comparisons and short string optimizations ## What changes are included in this PR? Optimize (will comment inline) ## Are these changes tested? By CI. See also benchmark results below. I tested manually as well Create Data: ```shell nice tpchgen-cli --tables=lineitem --format=parquet --scale-factor 100 ``` Run query: ```shell hyperfine --warmup 3 " datafusion-cli -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" " ``` Before (main): 1.368s ```shell Benchmark 1: datafusion-cli -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;" Time (mean ± σ): 1.393 s ± 0.020 s [User: 16.778 s, System: 0.688 s] Range (min … max): 1.368 s … 1.438 s 10 runs ``` After (this PR) 1.022s ```shell Benchmark 1: ./datafusion-cli-multi-gby-try2 -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;" Time (mean ± σ): 1.022 s ± 0.015 s [User: 11.685 s, System: 0.644 s] Range (min … max): 1.005 s … 1.052 s 10 runs ``` I have a PR that improves string view hashing performance too, see - #19374 ## Are there any user-facing changes? Faster performance
## Which issue does this PR close? - builds on #19373 - part of #18411 - Broken out of #19344 - Closes #19344 ## Rationale for this change While looking at performance as part of #18411, I noticed we could speed up string view hashing by optimizing for small strings ## What changes are included in this PR? Optimize StringView hashing, specifically by using the inlined view for short strings ## Are these changes tested? Functionally by existing coverage Performance by benchmarks (added in #19373) which show * 15%-20% faster for mixed short/long strings * 50%-70% faster for "short" arrays where we know there are no strings longer than 12 bytes ``` utf8_view (small): multiple, no nulls 1.00 47.9±1.71µs ? ?/sec 4.00 191.6±1.15µs ? ?/sec utf8_view (small): multiple, nulls 1.00 78.4±0.48µs ? ?/sec 3.08 241.6±1.11µs ? ?/sec utf8_view (small): single, no nulls 1.00 13.9±0.19µs ? ?/sec 4.29 59.7±0.30µs ? ?/sec utf8_view (small): single, nulls 1.00 23.8±0.20µs ? ?/sec 3.10 73.7±1.03µs ? ?/sec utf8_view: multiple, no nulls 1.00 235.4±2.14µs ? ?/sec 1.11 262.2±1.34µs ? ?/sec utf8_view: multiple, nulls 1.00 227.2±2.11µs ? ?/sec 1.34 303.9±2.23µs ? ?/sec utf8_view: single, no nulls 1.00 71.6±0.74µs ? ?/sec 1.05 75.2±1.27µs ? ?/sec utf8_view: single, nulls 1.00 71.5±1.92µs ? ?/sec 1.28 91.6±4.65µs ``` <details><summary>Details</summary> <p> ``` Gnuplot not found, using plotters backend utf8_view: single, no nulls time: [20.872 µs 20.906 µs 20.944 µs] change: [−15.863% −15.614% −15.331%] (p = 0.00 < 0.05) Performance has improved. Found 13 outliers among 100 measurements (13.00%) 8 (8.00%) high mild 5 (5.00%) high severe utf8_view: single, nulls time: [22.968 µs 23.050 µs 23.130 µs] change: [−17.796% −17.384% −16.918%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe utf8_view: multiple, no nulls time: [66.005 µs 66.155 µs 66.325 µs] change: [−19.077% −18.785% −18.512%] (p = 0.00 < 0.05) Performance has improved. utf8_view: multiple, nulls time: [72.155 µs 72.375 µs 72.649 µs] change: [−17.944% −17.612% −17.266%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 6 (6.00%) high mild 5 (5.00%) high severe utf8_view (small): single, no nulls time: [6.1401 µs 6.1563 µs 6.1747 µs] change: [−69.623% −69.484% −69.333%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 3 (3.00%) high mild 3 (3.00%) high severe utf8_view (small): single, nulls time: [10.234 µs 10.250 µs 10.270 µs] change: [−53.969% −53.815% −53.666%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high severe utf8_view (small): multiple, no nulls time: [20.853 µs 20.905 µs 20.961 µs] change: [−66.006% −65.883% −65.759%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 7 (7.00%) high mild 2 (2.00%) high severe utf8_view (small): multiple, nulls time: [32.519 µs 32.600 µs 32.675 µs] change: [−53.937% −53.581% −53.232%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild ``` </p> </details> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
Which issue does this PR close?
Rationale for this change
Testing some ideas to make gby faster
What changes are included in this PR?
Are these changes tested?
I benchmarked this manually like this
Main
This branch
Are there any user-facing changes?