Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Dec 15, 2025

Which issue does this PR close?

Rationale for this change

Testing some ideas to make gby faster

What changes are included in this PR?

Are these changes tested?

I benchmarked this manually like this

time ~/Software/datafusion2/target/profiling/datafusion-cli  -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"

Main

real	0m1.461s
user	0m16.802s
sys	0m0.826s

This branch

real	0m1.168s
user	0m12.951s
sys	0m0.759s

Are there any user-facing changes?

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Dec 15, 2025
@alamb
Copy link
Contributor Author

alamb commented Dec 15, 2025

I think there is more performance to be had by squeezing hashing for byteview

@alamb
Copy link
Contributor Author

alamb commented Dec 15, 2025

run benchmarks

@github-actions github-actions bot added the common Related to common crate label Dec 16, 2025
@Dandandan
Copy link
Contributor

I think there is more performance to be had by squeezing hashing for byteview

Yeah I think this could be a really nice optimization.

for (hash, &v) in hashes_buffer.iter_mut().zip(array.views().iter()) {
let view_len = v as u32;
// if the length is not inlined, then we need to hash the bytes as well
if view_len > 12 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also eliminate this branch when having no data buffers at all.

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/optimize_the_byte_view (9a83e18) to 9d4fe15 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and alamb_optimize_the_byte_view
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_optimize_the_byte_view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2652.09 ms │                   2572.10 ms │     no change │
│ QQuery 1     │  1257.62 ms │                   1134.09 ms │ +1.11x faster │
│ QQuery 2     │  2393.02 ms │                   2154.42 ms │ +1.11x faster │
│ QQuery 3     │  1152.57 ms │                   1175.51 ms │     no change │
│ QQuery 4     │  2295.10 ms │                   2304.54 ms │     no change │
│ QQuery 5     │ 28526.58 ms │                  28507.26 ms │     no change │
│ QQuery 6     │  3980.18 ms │                   3954.83 ms │     no change │
│ QQuery 7     │  3642.53 ms │                   3648.31 ms │     no change │
└──────────────┴─────────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 45899.69ms │
│ Total Time (alamb_optimize_the_byte_view)   │ 45451.05ms │
│ Average Time (HEAD)                         │  5737.46ms │
│ Average Time (alamb_optimize_the_byte_view) │  5681.38ms │
│ Queries Faster                              │          2 │
│ Queries Slower                              │          0 │
│ Queries with No Change                      │          6 │
│ Queries with Failure                        │          0 │
└─────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_optimize_the_byte_view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.52 ms │                      2.20 ms │ +1.14x faster │
│ QQuery 1     │    51.48 ms │                     50.45 ms │     no change │
│ QQuery 2     │   132.58 ms │                    134.11 ms │     no change │
│ QQuery 3     │   153.36 ms │                    156.20 ms │     no change │
│ QQuery 4     │  1068.20 ms │                   1110.62 ms │     no change │
│ QQuery 5     │  1488.34 ms │                   1465.03 ms │     no change │
│ QQuery 6     │     2.17 ms │                      2.26 ms │     no change │
│ QQuery 7     │    57.51 ms │                     55.52 ms │     no change │
│ QQuery 8     │  1442.79 ms │                   1449.27 ms │     no change │
│ QQuery 9     │  1779.36 ms │                   1897.01 ms │  1.07x slower │
│ QQuery 10    │   391.15 ms │                    379.83 ms │     no change │
│ QQuery 11    │   430.63 ms │                    430.04 ms │     no change │
│ QQuery 12    │  1327.45 ms │                   1377.90 ms │     no change │
│ QQuery 13    │  2027.07 ms │                   2034.16 ms │     no change │
│ QQuery 14    │  1245.35 ms │                   1271.65 ms │     no change │
│ QQuery 15    │  1230.59 ms │                   1252.15 ms │     no change │
│ QQuery 16    │  2609.32 ms │                   2583.85 ms │     no change │
│ QQuery 17    │  2605.69 ms │                   2558.04 ms │     no change │
│ QQuery 18    │  5678.36 ms │                   4890.13 ms │ +1.16x faster │
│ QQuery 19    │   119.21 ms │                    121.78 ms │     no change │
│ QQuery 20    │  1982.37 ms │                   1861.90 ms │ +1.06x faster │
│ QQuery 21    │  2279.91 ms │                   2204.19 ms │     no change │
│ QQuery 22    │  7587.70 ms │                   3771.69 ms │ +2.01x faster │
│ QQuery 23    │ 22746.61 ms │                  12614.73 ms │ +1.80x faster │
│ QQuery 24    │   212.85 ms │                    217.54 ms │     no change │
│ QQuery 25    │   483.62 ms │                    476.16 ms │     no change │
│ QQuery 26    │   229.86 ms │                    222.78 ms │     no change │
│ QQuery 27    │  2771.86 ms │                   2673.74 ms │     no change │
│ QQuery 28    │ 24560.90 ms │                  24025.71 ms │     no change │
│ QQuery 29    │   960.72 ms │                    989.92 ms │     no change │
│ QQuery 30    │  1327.38 ms │                   1344.65 ms │     no change │
│ QQuery 31    │  1348.26 ms │                   1328.22 ms │     no change │
│ QQuery 32    │  4903.73 ms │                   4650.76 ms │ +1.05x faster │
│ QQuery 33    │  5884.95 ms │                   5725.91 ms │     no change │
│ QQuery 34    │  6207.45 ms │                   6072.41 ms │     no change │
│ QQuery 35    │  1940.99 ms │                   1933.14 ms │     no change │
│ QQuery 36    │    67.94 ms │                     67.44 ms │     no change │
│ QQuery 37    │    45.66 ms │                     45.53 ms │     no change │
│ QQuery 38    │    66.81 ms │                     65.54 ms │     no change │
│ QQuery 39    │   102.91 ms │                    101.49 ms │     no change │
│ QQuery 40    │    27.82 ms │                     28.65 ms │     no change │
│ QQuery 41    │    22.22 ms │                     23.97 ms │  1.08x slower │
│ QQuery 42    │    20.39 ms │                     19.53 ms │     no change │
└──────────────┴─────────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 109626.08ms │
│ Total Time (alamb_optimize_the_byte_view)   │  93687.81ms │
│ Average Time (HEAD)                         │   2549.44ms │
│ Average Time (alamb_optimize_the_byte_view) │   2178.79ms │
│ Queries Faster                              │           6 │
│ Queries Slower                              │           2 │
│ Queries with No Change                      │          35 │
│ Queries with Failure                        │           0 │
└─────────────────────────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_optimize_the_byte_view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 140.99 ms │                    126.61 ms │ +1.11x faster │
│ QQuery 2     │  27.45 ms │                     28.44 ms │     no change │
│ QQuery 3     │  37.71 ms │                     38.44 ms │     no change │
│ QQuery 4     │  28.54 ms │                     29.21 ms │     no change │
│ QQuery 5     │  88.25 ms │                     89.46 ms │     no change │
│ QQuery 6     │  20.13 ms │                     19.93 ms │     no change │
│ QQuery 7     │ 235.91 ms │                    229.12 ms │     no change │
│ QQuery 8     │  38.00 ms │                     41.45 ms │  1.09x slower │
│ QQuery 9     │ 106.54 ms │                    110.82 ms │     no change │
│ QQuery 10    │  64.18 ms │                     65.59 ms │     no change │
│ QQuery 11    │  18.27 ms │                     19.04 ms │     no change │
│ QQuery 12    │  51.65 ms │                     51.69 ms │     no change │
│ QQuery 13    │  49.99 ms │                     48.64 ms │     no change │
│ QQuery 14    │  13.98 ms │                     14.16 ms │     no change │
│ QQuery 15    │  25.32 ms │                     25.02 ms │     no change │
│ QQuery 16    │  25.18 ms │                     24.86 ms │     no change │
│ QQuery 17    │ 152.33 ms │                    156.61 ms │     no change │
│ QQuery 18    │ 288.60 ms │                    290.98 ms │     no change │
│ QQuery 19    │  38.70 ms │                     37.36 ms │     no change │
│ QQuery 20    │  49.98 ms │                     51.02 ms │     no change │
│ QQuery 21    │ 332.22 ms │                    342.45 ms │     no change │
│ QQuery 22    │  17.49 ms │                     17.58 ms │     no change │
└──────────────┴───────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 1851.42ms │
│ Total Time (alamb_optimize_the_byte_view)   │ 1858.47ms │
│ Average Time (HEAD)                         │   84.16ms │
│ Average Time (alamb_optimize_the_byte_view) │   84.48ms │
│ Queries Faster                              │         1 │
│ Queries Slower                              │         1 │
│ Queries with No Change                      │        20 │
│ Queries with Failure                        │         0 │
└─────────────────────────────────────────────┴───────────┘

@Dandandan
Copy link
Contributor

run benchmark tpch

@alamb alamb changed the title Optimize byte view comparison in groupby Optimize byte view comparison in multi groupby Dec 16, 2025
@alamb
Copy link
Contributor Author

alamb commented Dec 16, 2025

(BTW I am testing some other improvements to hash locally)

My next plan is:

  1. Finish up messing around with string view hashing

  2. Confirm benchmarks

  3. Split this up into a few PRs for easier review:

  4. faster comparison in gby stringview

  5. faster hash computation for string view

  6. Uncheckd access in counts.rs

@alamb
Copy link
Contributor Author

alamb commented Dec 16, 2025

run benchmarks

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/optimize_the_byte_view (3031658) to 9d4fe15 diff using: tpch
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and alamb_optimize_the_byte_view
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_optimize_the_byte_view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 227.82 ms │                    195.17 ms │ +1.17x faster │
│ QQuery 2     │  96.78 ms │                     99.12 ms │     no change │
│ QQuery 3     │ 123.84 ms │                    130.77 ms │  1.06x slower │
│ QQuery 4     │  79.17 ms │                     78.46 ms │     no change │
│ QQuery 5     │ 176.30 ms │                    182.57 ms │     no change │
│ QQuery 6     │  69.02 ms │                     63.58 ms │ +1.09x faster │
│ QQuery 7     │ 224.07 ms │                    217.59 ms │     no change │
│ QQuery 8     │ 164.76 ms │                    163.15 ms │     no change │
│ QQuery 9     │ 229.94 ms │                    230.60 ms │     no change │
│ QQuery 10    │ 193.37 ms │                    188.98 ms │     no change │
│ QQuery 11    │  77.62 ms │                     76.80 ms │     no change │
│ QQuery 12    │ 125.10 ms │                    115.16 ms │ +1.09x faster │
│ QQuery 13    │ 235.89 ms │                    228.80 ms │     no change │
│ QQuery 14    │ 100.37 ms │                     98.79 ms │     no change │
│ QQuery 15    │ 125.42 ms │                    127.09 ms │     no change │
│ QQuery 16    │  59.48 ms │                     59.98 ms │     no change │
│ QQuery 17    │ 311.02 ms │                    311.52 ms │     no change │
│ QQuery 18    │ 336.21 ms │                    323.24 ms │     no change │
│ QQuery 19    │ 145.26 ms │                    140.32 ms │     no change │
│ QQuery 20    │ 130.93 ms │                    129.32 ms │     no change │
│ QQuery 21    │ 284.14 ms │                    266.93 ms │ +1.06x faster │
│ QQuery 22    │  45.74 ms │                     44.74 ms │     no change │
└──────────────┴───────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 3562.25ms │
│ Total Time (alamb_optimize_the_byte_view)   │ 3472.69ms │
│ Average Time (HEAD)                         │  161.92ms │
│ Average Time (alamb_optimize_the_byte_view) │  157.85ms │
│ Queries Faster                              │         4 │
│ Queries Slower                              │         1 │
│ Queries with No Change                      │        17 │
│ Queries with Failure                        │         0 │
└─────────────────────────────────────────────┴───────────┘

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/optimize_the_byte_view (3031658) to 9d4fe15 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and alamb_optimize_the_byte_view
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_optimize_the_byte_view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2799.69 ms │                   2574.25 ms │ +1.09x faster │
│ QQuery 1     │  1279.95 ms │                   1039.89 ms │ +1.23x faster │
│ QQuery 2     │  2460.40 ms │                   2131.44 ms │ +1.15x faster │
│ QQuery 3     │  1196.03 ms │                   1155.16 ms │     no change │
│ QQuery 4     │  2346.79 ms │                   2308.15 ms │     no change │
│ QQuery 5     │ 29110.84 ms │                  29343.89 ms │     no change │
│ QQuery 6     │  4050.00 ms │                   3864.15 ms │     no change │
│ QQuery 7     │  3737.20 ms │                   3764.35 ms │     no change │
└──────────────┴─────────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 46980.90ms │
│ Total Time (alamb_optimize_the_byte_view)   │ 46181.29ms │
│ Average Time (HEAD)                         │  5872.61ms │
│ Average Time (alamb_optimize_the_byte_view) │  5772.66ms │
│ Queries Faster                              │          3 │
│ Queries Slower                              │          0 │
│ Queries with No Change                      │          5 │
│ Queries with Failure                        │          0 │
└─────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_optimize_the_byte_view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.80 ms │                      2.88 ms │     no change │
│ QQuery 1     │    52.08 ms │                     50.82 ms │     no change │
│ QQuery 2     │   135.85 ms │                    138.08 ms │     no change │
│ QQuery 3     │   159.01 ms │                    159.77 ms │     no change │
│ QQuery 4     │  1138.73 ms │                   1165.41 ms │     no change │
│ QQuery 5     │  1503.09 ms │                   1499.15 ms │     no change │
│ QQuery 6     │     2.09 ms │                      2.39 ms │  1.15x slower │
│ QQuery 7     │    55.44 ms │                     56.54 ms │     no change │
│ QQuery 8     │  1436.89 ms │                   1493.36 ms │     no change │
│ QQuery 9     │  1926.85 ms │                   1914.12 ms │     no change │
│ QQuery 10    │   391.74 ms │                    364.48 ms │ +1.07x faster │
│ QQuery 11    │   454.37 ms │                    415.55 ms │ +1.09x faster │
│ QQuery 12    │  1371.60 ms │                   1399.67 ms │     no change │
│ QQuery 13    │  2134.23 ms │                   2061.67 ms │     no change │
│ QQuery 14    │  1288.14 ms │                   1268.32 ms │     no change │
│ QQuery 15    │  1270.42 ms │                   1317.36 ms │     no change │
│ QQuery 16    │  2730.47 ms │                   2631.29 ms │     no change │
│ QQuery 17    │  2721.96 ms │                   2627.06 ms │     no change │
│ QQuery 18    │  5648.28 ms │                   4957.78 ms │ +1.14x faster │
│ QQuery 19    │   124.02 ms │                    124.89 ms │     no change │
│ QQuery 20    │  1956.32 ms │                   1887.89 ms │     no change │
│ QQuery 21    │  2204.17 ms │                   2181.39 ms │     no change │
│ QQuery 22    │  3815.05 ms │                   3817.00 ms │     no change │
│ QQuery 23    │ 16708.90 ms │                  12274.50 ms │ +1.36x faster │
│ QQuery 24    │   224.45 ms │                    219.19 ms │     no change │
│ QQuery 25    │   488.60 ms │                    462.32 ms │ +1.06x faster │
│ QQuery 26    │   226.79 ms │                    225.29 ms │     no change │
│ QQuery 27    │  2855.22 ms │                   2799.23 ms │     no change │
│ QQuery 28    │ 24531.31 ms │                  23924.61 ms │     no change │
│ QQuery 29    │   997.42 ms │                    956.93 ms │     no change │
│ QQuery 30    │  1386.97 ms │                   1311.89 ms │ +1.06x faster │
│ QQuery 31    │  1368.49 ms │                   1358.16 ms │     no change │
│ QQuery 32    │  5479.22 ms │                   4750.64 ms │ +1.15x faster │
│ QQuery 33    │  6185.36 ms │                   5865.26 ms │ +1.05x faster │
│ QQuery 34    │  6434.11 ms │                   6287.72 ms │     no change │
│ QQuery 35    │  1981.57 ms │                   1948.81 ms │     no change │
│ QQuery 36    │    68.07 ms │                     65.34 ms │     no change │
│ QQuery 37    │    47.89 ms │                     45.68 ms │     no change │
│ QQuery 38    │    68.16 ms │                     66.41 ms │     no change │
│ QQuery 39    │   107.73 ms │                    102.58 ms │     no change │
│ QQuery 40    │    27.87 ms │                     26.80 ms │     no change │
│ QQuery 41    │    24.30 ms │                     24.12 ms │     no change │
│ QQuery 42    │    20.82 ms │                     20.84 ms │     no change │
└──────────────┴─────────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 101756.86ms │
│ Total Time (alamb_optimize_the_byte_view)   │  94273.20ms │
│ Average Time (HEAD)                         │   2366.44ms │
│ Average Time (alamb_optimize_the_byte_view) │   2192.40ms │
│ Queries Faster                              │           8 │
│ Queries Slower                              │           1 │
│ Queries with No Change                      │          34 │
│ Queries with Failure                        │           0 │
└─────────────────────────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_optimize_the_byte_view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 143.59 ms │                    116.45 ms │ +1.23x faster │
│ QQuery 2     │  28.01 ms │                     29.39 ms │     no change │
│ QQuery 3     │  34.93 ms │                     39.75 ms │  1.14x slower │
│ QQuery 4     │  29.40 ms │                     29.30 ms │     no change │
│ QQuery 5     │  90.93 ms │                     90.67 ms │     no change │
│ QQuery 6     │  20.10 ms │                     19.87 ms │     no change │
│ QQuery 7     │ 253.07 ms │                    240.28 ms │ +1.05x faster │
│ QQuery 8     │  39.41 ms │                     39.42 ms │     no change │
│ QQuery 9     │ 107.87 ms │                    113.96 ms │  1.06x slower │
│ QQuery 10    │  66.53 ms │                     65.10 ms │     no change │
│ QQuery 11    │  17.69 ms │                     20.24 ms │  1.14x slower │
│ QQuery 12    │  52.29 ms │                     52.81 ms │     no change │
│ QQuery 13    │  49.88 ms │                     48.08 ms │     no change │
│ QQuery 14    │  14.31 ms │                     13.94 ms │     no change │
│ QQuery 15    │  25.76 ms │                     25.18 ms │     no change │
│ QQuery 16    │  25.33 ms │                     25.02 ms │     no change │
│ QQuery 17    │ 157.64 ms │                    158.63 ms │     no change │
│ QQuery 18    │ 285.29 ms │                    289.96 ms │     no change │
│ QQuery 19    │  38.48 ms │                     39.13 ms │     no change │
│ QQuery 20    │  52.35 ms │                     53.25 ms │     no change │
│ QQuery 21    │ 338.74 ms │                    333.22 ms │     no change │
│ QQuery 22    │  18.10 ms │                     18.45 ms │     no change │
└──────────────┴───────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 1889.72ms │
│ Total Time (alamb_optimize_the_byte_view)   │ 1862.10ms │
│ Average Time (HEAD)                         │   85.90ms │
│ Average Time (alamb_optimize_the_byte_view) │   84.64ms │
│ Queries Faster                              │         2 │
│ Queries Slower                              │         3 │
│ Queries with No Change                      │        17 │
│ Queries with Failure                        │         0 │
└─────────────────────────────────────────────┴───────────┘

@alamb
Copy link
Contributor Author

alamb commented Dec 16, 2025

I broke the first part of this PR into

Will put the hash improvements in their own PR

return result;
result
} else {
self.do_equal_to_inner_values_only(lhs_row, array, rhs_row)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this could be optimized as well (in the outer loop) for empty data buffers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will give it a try

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this conversation to #19364

);

// if the input array has no nulls, can skip null check
for (&lhs_row, &rhs_row, equal_to_result) in iter {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can get this to compile to a well-vectorized loop.

@alamb alamb changed the title Optimize byte view comparison in multi groupby TESTING: Optimize byte view comparison in multi groupby Dec 17, 2025
@alamb
Copy link
Contributor Author

alamb commented Dec 17, 2025

I with the specialized hash function it gets even faster


time ~/Software/datafusion2/target/profiling/datafusion-cli  -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"
DataFusion CLI v51.0.0
+--------------+--------------+-------------+
| l_returnflag | l_linestatus | count_order |
+--------------+--------------+-------------+
| N            | F            | 3864590     |
| A            | F            | 148047881   |
| R            | F            | 148067261   |
| N            | O            | 300058170   |
+--------------+--------------+-------------+
4 row(s) fetched.
Elapsed 1.024 seconds.


real	0m1.080s
user	0m12.899s
sys	0m0.678s

@alamb
Copy link
Contributor Author

alamb commented Dec 17, 2025

run benchmark tpch_mem

@alamb
Copy link
Contributor Author

alamb commented Dec 17, 2025

show benchmark queue

@alamb-ghbot
Copy link

🤖 Hi @alamb, you asked to view the benchmark queue (#19344 (comment)).

Job User Benchmarks Comment
19364_3664020397.sh Dandandan default https://github.com/apache/datafusion/pull/19364#issuecomment-3664020397
arrow-8990-3662017225.sh alamb view_types concatenate_kernel https://github.com/apache/arrow-rs/pull/8990#issuecomment-3662017225
19344_3665142711.sh alamb default https://github.com/apache/datafusion/pull/19344#issuecomment-3665142711

@alamb-ghbot
Copy link

Benchmark script failed with exit code 2.

Last 10 lines of output:

Click to expand
this is  a purposely a feailure
ls: cannot access 'gaaa': No such file or directory

1 similar comment
@alamb-ghbot
Copy link

Benchmark script failed with exit code 2.

Last 10 lines of output:

Click to expand
this is  a purposely a feailure
ls: cannot access 'gaaa': No such file or directory

@alamb
Copy link
Contributor Author

alamb commented Dec 17, 2025

I have broken the changes in this PR into two chunks:

So closing this one down to work on them

@alamb alamb closed this Dec 17, 2025
github-merge-queue bot pushed a commit that referenced this pull request Dec 20, 2025
…5% faster (#19413)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of #18411
- Closes #19344
- Closes #19364

Note this is an alternate to
#19364

## Rationale for this change

@camuel found a query where DuckDB's raw grouping is is faster. 

I looked into it and much of the difference can be explained by better
vectorization in the comparisons and short string optimizations

## What changes are included in this PR?

Optimize (will comment inline)

## Are these changes tested?

By CI. See also benchmark results below. I tested manually as well

Create Data:
```shell
nice tpchgen-cli --tables=lineitem --format=parquet --scale-factor 100
```

Run query:
```shell
hyperfine --warmup 3 " datafusion-cli   -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" "
```

Before (main): 1.368s
```shell
Benchmark 1:  datafusion-cli   -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"
  Time (mean ± σ):      1.393 s ±  0.020 s    [User: 16.778 s, System: 0.688 s]
  Range (min … max):    1.368 s …  1.438 s    10 runs
```

After (this PR) 1.022s
```shell
Benchmark 1:  ./datafusion-cli-multi-gby-try2   -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"
  Time (mean ± σ):      1.022 s ±  0.015 s    [User: 11.685 s, System: 0.644 s]
  Range (min … max):    1.005 s …  1.052 s    10 runs
```

I have a PR that improves string view hashing performance too, see
- #19374

## Are there any user-facing changes?
Faster performance
github-merge-queue bot pushed a commit that referenced this pull request Dec 20, 2025
## Which issue does this PR close?

- builds on #19373
- part of #18411
- Broken out of #19344
- Closes #19344

## Rationale for this change

While looking at performance as part of
#18411, I noticed we could
speed up string view hashing by optimizing for small strings

## What changes are included in this PR?

Optimize StringView hashing, specifically by using the inlined view for
short strings

## Are these changes tested?

Functionally by existing coverage

Performance by benchmarks (added in
#19373) which show
* 15%-20% faster for mixed short/long strings
* 50%-70% faster for "short" arrays where we know there are no strings
longer than 12 bytes

```
utf8_view (small): multiple, no nulls        1.00     47.9±1.71µs        ? ?/sec    4.00    191.6±1.15µs        ? ?/sec
utf8_view (small): multiple, nulls           1.00     78.4±0.48µs        ? ?/sec    3.08    241.6±1.11µs        ? ?/sec
utf8_view (small): single, no nulls          1.00     13.9±0.19µs        ? ?/sec    4.29     59.7±0.30µs        ? ?/sec
utf8_view (small): single, nulls             1.00     23.8±0.20µs        ? ?/sec    3.10     73.7±1.03µs        ? ?/sec
utf8_view: multiple, no nulls                1.00    235.4±2.14µs        ? ?/sec    1.11    262.2±1.34µs        ? ?/sec
utf8_view: multiple, nulls                   1.00    227.2±2.11µs        ? ?/sec    1.34    303.9±2.23µs        ? ?/sec
utf8_view: single, no nulls                  1.00     71.6±0.74µs        ? ?/sec    1.05     75.2±1.27µs        ? ?/sec
utf8_view: single, nulls                     1.00     71.5±1.92µs        ? ?/sec    1.28     91.6±4.65µs  
```


<details><summary>Details</summary>
<p>

```
Gnuplot not found, using plotters backend
utf8_view: single, no nulls
                        time:   [20.872 µs 20.906 µs 20.944 µs]
                        change: [−15.863% −15.614% −15.331%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  8 (8.00%) high mild
  5 (5.00%) high severe

utf8_view: single, nulls
                        time:   [22.968 µs 23.050 µs 23.130 µs]
                        change: [−17.796% −17.384% −16.918%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe

utf8_view: multiple, no nulls
                        time:   [66.005 µs 66.155 µs 66.325 µs]
                        change: [−19.077% −18.785% −18.512%] (p = 0.00 < 0.05)
                        Performance has improved.

utf8_view: multiple, nulls
                        time:   [72.155 µs 72.375 µs 72.649 µs]
                        change: [−17.944% −17.612% −17.266%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  6 (6.00%) high mild
  5 (5.00%) high severe

utf8_view (small): single, no nulls
                        time:   [6.1401 µs 6.1563 µs 6.1747 µs]
                        change: [−69.623% −69.484% −69.333%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

utf8_view (small): single, nulls
                        time:   [10.234 µs 10.250 µs 10.270 µs]
                        change: [−53.969% −53.815% −53.666%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe

utf8_view (small): multiple, no nulls
                        time:   [20.853 µs 20.905 µs 20.961 µs]
                        change: [−66.006% −65.883% −65.759%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  7 (7.00%) high mild
  2 (2.00%) high severe

utf8_view (small): multiple, nulls
                        time:   [32.519 µs 32.600 µs 32.675 µs]
                        change: [−53.937% −53.581% −53.232%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
```

</p>
</details> 

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TPCH q1 with no predicates is 2x slower than duckdb

3 participants