TESTING: Optimize byte view comparison in multi groupby #19344

alamb · 2025-12-15T20:52:51Z

Which issue does this PR close?

Closes TPCH q1 with no predicates is 2x slower than duckdb #18411

Rationale for this change

Testing some ideas to make gby faster

What changes are included in this PR?

Are these changes tested?

I benchmarked this manually like this

time ~/Software/datafusion2/target/profiling/datafusion-cli  -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"

Main

real	0m1.461s
user	0m16.802s
sys	0m0.826s

This branch

real	0m1.168s
user	0m12.951s
sys	0m0.759s

Are there any user-facing changes?

alamb · 2025-12-15T21:53:12Z

I think there is more performance to be had by squeezing hashing for byteview

alamb · 2025-12-15T21:53:18Z

run benchmarks

Dandandan · 2025-12-16T07:34:08Z

I think there is more performance to be had by squeezing hashing for byteview

Yeah I think this could be a really nice optimization.

Dandandan · 2025-12-16T07:35:23Z

datafusion/common/src/hash_utils.rs

+            for (hash, &v) in hashes_buffer.iter_mut().zip(array.views().iter()) {
+                let view_len = v as u32;
+                // if the length is not inlined, then we need to hash the bytes as well
+                if view_len > 12 {


Could also eliminate this branch when having no data buffers at all.

alamb-ghbot · 2025-12-16T12:01:27Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/optimize_the_byte_view (9a83e18) to 9d4fe15 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb-ghbot · 2025-12-16T12:42:37Z

🤖: Benchmark completed

Details

Comparing HEAD and alamb_optimize_the_byte_view
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_optimize_the_byte_view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2652.09 ms │                   2572.10 ms │     no change │
│ QQuery 1     │  1257.62 ms │                   1134.09 ms │ +1.11x faster │
│ QQuery 2     │  2393.02 ms │                   2154.42 ms │ +1.11x faster │
│ QQuery 3     │  1152.57 ms │                   1175.51 ms │     no change │
│ QQuery 4     │  2295.10 ms │                   2304.54 ms │     no change │
│ QQuery 5     │ 28526.58 ms │                  28507.26 ms │     no change │
│ QQuery 6     │  3980.18 ms │                   3954.83 ms │     no change │
│ QQuery 7     │  3642.53 ms │                   3648.31 ms │     no change │
└──────────────┴─────────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 45899.69ms │
│ Total Time (alamb_optimize_the_byte_view)   │ 45451.05ms │
│ Average Time (HEAD)                         │  5737.46ms │
│ Average Time (alamb_optimize_the_byte_view) │  5681.38ms │
│ Queries Faster                              │          2 │
│ Queries Slower                              │          0 │
│ Queries with No Change                      │          6 │
│ Queries with Failure                        │          0 │
└─────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_optimize_the_byte_view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.52 ms │                      2.20 ms │ +1.14x faster │
│ QQuery 1     │    51.48 ms │                     50.45 ms │     no change │
│ QQuery 2     │   132.58 ms │                    134.11 ms │     no change │
│ QQuery 3     │   153.36 ms │                    156.20 ms │     no change │
│ QQuery 4     │  1068.20 ms │                   1110.62 ms │     no change │
│ QQuery 5     │  1488.34 ms │                   1465.03 ms │     no change │
│ QQuery 6     │     2.17 ms │                      2.26 ms │     no change │
│ QQuery 7     │    57.51 ms │                     55.52 ms │     no change │
│ QQuery 8     │  1442.79 ms │                   1449.27 ms │     no change │
│ QQuery 9     │  1779.36 ms │                   1897.01 ms │  1.07x slower │
│ QQuery 10    │   391.15 ms │                    379.83 ms │     no change │
│ QQuery 11    │   430.63 ms │                    430.04 ms │     no change │
│ QQuery 12    │  1327.45 ms │                   1377.90 ms │     no change │
│ QQuery 13    │  2027.07 ms │                   2034.16 ms │     no change │
│ QQuery 14    │  1245.35 ms │                   1271.65 ms │     no change │
│ QQuery 15    │  1230.59 ms │                   1252.15 ms │     no change │
│ QQuery 16    │  2609.32 ms │                   2583.85 ms │     no change │
│ QQuery 17    │  2605.69 ms │                   2558.04 ms │     no change │
│ QQuery 18    │  5678.36 ms │                   4890.13 ms │ +1.16x faster │
│ QQuery 19    │   119.21 ms │                    121.78 ms │     no change │
│ QQuery 20    │  1982.37 ms │                   1861.90 ms │ +1.06x faster │
│ QQuery 21    │  2279.91 ms │                   2204.19 ms │     no change │
│ QQuery 22    │  7587.70 ms │                   3771.69 ms │ +2.01x faster │
│ QQuery 23    │ 22746.61 ms │                  12614.73 ms │ +1.80x faster │
│ QQuery 24    │   212.85 ms │                    217.54 ms │     no change │
│ QQuery 25    │   483.62 ms │                    476.16 ms │     no change │
│ QQuery 26    │   229.86 ms │                    222.78 ms │     no change │
│ QQuery 27    │  2771.86 ms │                   2673.74 ms │     no change │
│ QQuery 28    │ 24560.90 ms │                  24025.71 ms │     no change │
│ QQuery 29    │   960.72 ms │                    989.92 ms │     no change │
│ QQuery 30    │  1327.38 ms │                   1344.65 ms │     no change │
│ QQuery 31    │  1348.26 ms │                   1328.22 ms │     no change │
│ QQuery 32    │  4903.73 ms │                   4650.76 ms │ +1.05x faster │
│ QQuery 33    │  5884.95 ms │                   5725.91 ms │     no change │
│ QQuery 34    │  6207.45 ms │                   6072.41 ms │     no change │
│ QQuery 35    │  1940.99 ms │                   1933.14 ms │     no change │
│ QQuery 36    │    67.94 ms │                     67.44 ms │     no change │
│ QQuery 37    │    45.66 ms │                     45.53 ms │     no change │
│ QQuery 38    │    66.81 ms │                     65.54 ms │     no change │
│ QQuery 39    │   102.91 ms │                    101.49 ms │     no change │
│ QQuery 40    │    27.82 ms │                     28.65 ms │     no change │
│ QQuery 41    │    22.22 ms │                     23.97 ms │  1.08x slower │
│ QQuery 42    │    20.39 ms │                     19.53 ms │     no change │
└──────────────┴─────────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 109626.08ms │
│ Total Time (alamb_optimize_the_byte_view)   │  93687.81ms │
│ Average Time (HEAD)                         │   2549.44ms │
│ Average Time (alamb_optimize_the_byte_view) │   2178.79ms │
│ Queries Faster                              │           6 │
│ Queries Slower                              │           2 │
│ Queries with No Change                      │          35 │
│ Queries with Failure                        │           0 │
└─────────────────────────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_optimize_the_byte_view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 140.99 ms │                    126.61 ms │ +1.11x faster │
│ QQuery 2     │  27.45 ms │                     28.44 ms │     no change │
│ QQuery 3     │  37.71 ms │                     38.44 ms │     no change │
│ QQuery 4     │  28.54 ms │                     29.21 ms │     no change │
│ QQuery 5     │  88.25 ms │                     89.46 ms │     no change │
│ QQuery 6     │  20.13 ms │                     19.93 ms │     no change │
│ QQuery 7     │ 235.91 ms │                    229.12 ms │     no change │
│ QQuery 8     │  38.00 ms │                     41.45 ms │  1.09x slower │
│ QQuery 9     │ 106.54 ms │                    110.82 ms │     no change │
│ QQuery 10    │  64.18 ms │                     65.59 ms │     no change │
│ QQuery 11    │  18.27 ms │                     19.04 ms │     no change │
│ QQuery 12    │  51.65 ms │                     51.69 ms │     no change │
│ QQuery 13    │  49.99 ms │                     48.64 ms │     no change │
│ QQuery 14    │  13.98 ms │                     14.16 ms │     no change │
│ QQuery 15    │  25.32 ms │                     25.02 ms │     no change │
│ QQuery 16    │  25.18 ms │                     24.86 ms │     no change │
│ QQuery 17    │ 152.33 ms │                    156.61 ms │     no change │
│ QQuery 18    │ 288.60 ms │                    290.98 ms │     no change │
│ QQuery 19    │  38.70 ms │                     37.36 ms │     no change │
│ QQuery 20    │  49.98 ms │                     51.02 ms │     no change │
│ QQuery 21    │ 332.22 ms │                    342.45 ms │     no change │
│ QQuery 22    │  17.49 ms │                     17.58 ms │     no change │
└──────────────┴───────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 1851.42ms │
│ Total Time (alamb_optimize_the_byte_view)   │ 1858.47ms │
│ Average Time (HEAD)                         │   84.16ms │
│ Average Time (alamb_optimize_the_byte_view) │   84.48ms │
│ Queries Faster                              │         1 │
│ Queries Slower                              │         1 │
│ Queries with No Change                      │        20 │
│ Queries with Failure                        │         0 │
└─────────────────────────────────────────────┴───────────┘

Dandandan · 2025-12-16T13:26:18Z

run benchmark tpch

alamb · 2025-12-16T14:37:55Z

(BTW I am testing some other improvements to hash locally)

My next plan is:

Finish up messing around with string view hashing
Confirm benchmarks
Split this up into a few PRs for easier review:
faster comparison in gby stringview
faster hash computation for string view
Uncheckd access in counts.rs

alamb · 2025-12-16T14:38:13Z

run benchmarks

alamb-ghbot · 2025-12-16T18:54:59Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/optimize_the_byte_view (3031658) to 9d4fe15 diff using: tpch
Results will be posted here when complete

alamb-ghbot · 2025-12-16T19:09:19Z

🤖: Benchmark completed

Details

Comparing HEAD and alamb_optimize_the_byte_view
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_optimize_the_byte_view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 227.82 ms │                    195.17 ms │ +1.17x faster │
│ QQuery 2     │  96.78 ms │                     99.12 ms │     no change │
│ QQuery 3     │ 123.84 ms │                    130.77 ms │  1.06x slower │
│ QQuery 4     │  79.17 ms │                     78.46 ms │     no change │
│ QQuery 5     │ 176.30 ms │                    182.57 ms │     no change │
│ QQuery 6     │  69.02 ms │                     63.58 ms │ +1.09x faster │
│ QQuery 7     │ 224.07 ms │                    217.59 ms │     no change │
│ QQuery 8     │ 164.76 ms │                    163.15 ms │     no change │
│ QQuery 9     │ 229.94 ms │                    230.60 ms │     no change │
│ QQuery 10    │ 193.37 ms │                    188.98 ms │     no change │
│ QQuery 11    │  77.62 ms │                     76.80 ms │     no change │
│ QQuery 12    │ 125.10 ms │                    115.16 ms │ +1.09x faster │
│ QQuery 13    │ 235.89 ms │                    228.80 ms │     no change │
│ QQuery 14    │ 100.37 ms │                     98.79 ms │     no change │
│ QQuery 15    │ 125.42 ms │                    127.09 ms │     no change │
│ QQuery 16    │  59.48 ms │                     59.98 ms │     no change │
│ QQuery 17    │ 311.02 ms │                    311.52 ms │     no change │
│ QQuery 18    │ 336.21 ms │                    323.24 ms │     no change │
│ QQuery 19    │ 145.26 ms │                    140.32 ms │     no change │
│ QQuery 20    │ 130.93 ms │                    129.32 ms │     no change │
│ QQuery 21    │ 284.14 ms │                    266.93 ms │ +1.06x faster │
│ QQuery 22    │  45.74 ms │                     44.74 ms │     no change │
└──────────────┴───────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 3562.25ms │
│ Total Time (alamb_optimize_the_byte_view)   │ 3472.69ms │
│ Average Time (HEAD)                         │  161.92ms │
│ Average Time (alamb_optimize_the_byte_view) │  157.85ms │
│ Queries Faster                              │         4 │
│ Queries Slower                              │         1 │
│ Queries with No Change                      │        17 │
│ Queries with Failure                        │         0 │
└─────────────────────────────────────────────┴───────────┘

alamb-ghbot · 2025-12-16T19:50:34Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/optimize_the_byte_view (3031658) to 9d4fe15 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb-ghbot · 2025-12-16T20:31:20Z

🤖: Benchmark completed

Details

Comparing HEAD and alamb_optimize_the_byte_view
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_optimize_the_byte_view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  2799.69 ms │                   2574.25 ms │ +1.09x faster │
│ QQuery 1     │  1279.95 ms │                   1039.89 ms │ +1.23x faster │
│ QQuery 2     │  2460.40 ms │                   2131.44 ms │ +1.15x faster │
│ QQuery 3     │  1196.03 ms │                   1155.16 ms │     no change │
│ QQuery 4     │  2346.79 ms │                   2308.15 ms │     no change │
│ QQuery 5     │ 29110.84 ms │                  29343.89 ms │     no change │
│ QQuery 6     │  4050.00 ms │                   3864.15 ms │     no change │
│ QQuery 7     │  3737.20 ms │                   3764.35 ms │     no change │
└──────────────┴─────────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 46980.90ms │
│ Total Time (alamb_optimize_the_byte_view)   │ 46181.29ms │
│ Average Time (HEAD)                         │  5872.61ms │
│ Average Time (alamb_optimize_the_byte_view) │  5772.66ms │
│ Queries Faster                              │          3 │
│ Queries Slower                              │          0 │
│ Queries with No Change                      │          5 │
│ Queries with Failure                        │          0 │
└─────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_optimize_the_byte_view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.80 ms │                      2.88 ms │     no change │
│ QQuery 1     │    52.08 ms │                     50.82 ms │     no change │
│ QQuery 2     │   135.85 ms │                    138.08 ms │     no change │
│ QQuery 3     │   159.01 ms │                    159.77 ms │     no change │
│ QQuery 4     │  1138.73 ms │                   1165.41 ms │     no change │
│ QQuery 5     │  1503.09 ms │                   1499.15 ms │     no change │
│ QQuery 6     │     2.09 ms │                      2.39 ms │  1.15x slower │
│ QQuery 7     │    55.44 ms │                     56.54 ms │     no change │
│ QQuery 8     │  1436.89 ms │                   1493.36 ms │     no change │
│ QQuery 9     │  1926.85 ms │                   1914.12 ms │     no change │
│ QQuery 10    │   391.74 ms │                    364.48 ms │ +1.07x faster │
│ QQuery 11    │   454.37 ms │                    415.55 ms │ +1.09x faster │
│ QQuery 12    │  1371.60 ms │                   1399.67 ms │     no change │
│ QQuery 13    │  2134.23 ms │                   2061.67 ms │     no change │
│ QQuery 14    │  1288.14 ms │                   1268.32 ms │     no change │
│ QQuery 15    │  1270.42 ms │                   1317.36 ms │     no change │
│ QQuery 16    │  2730.47 ms │                   2631.29 ms │     no change │
│ QQuery 17    │  2721.96 ms │                   2627.06 ms │     no change │
│ QQuery 18    │  5648.28 ms │                   4957.78 ms │ +1.14x faster │
│ QQuery 19    │   124.02 ms │                    124.89 ms │     no change │
│ QQuery 20    │  1956.32 ms │                   1887.89 ms │     no change │
│ QQuery 21    │  2204.17 ms │                   2181.39 ms │     no change │
│ QQuery 22    │  3815.05 ms │                   3817.00 ms │     no change │
│ QQuery 23    │ 16708.90 ms │                  12274.50 ms │ +1.36x faster │
│ QQuery 24    │   224.45 ms │                    219.19 ms │     no change │
│ QQuery 25    │   488.60 ms │                    462.32 ms │ +1.06x faster │
│ QQuery 26    │   226.79 ms │                    225.29 ms │     no change │
│ QQuery 27    │  2855.22 ms │                   2799.23 ms │     no change │
│ QQuery 28    │ 24531.31 ms │                  23924.61 ms │     no change │
│ QQuery 29    │   997.42 ms │                    956.93 ms │     no change │
│ QQuery 30    │  1386.97 ms │                   1311.89 ms │ +1.06x faster │
│ QQuery 31    │  1368.49 ms │                   1358.16 ms │     no change │
│ QQuery 32    │  5479.22 ms │                   4750.64 ms │ +1.15x faster │
│ QQuery 33    │  6185.36 ms │                   5865.26 ms │ +1.05x faster │
│ QQuery 34    │  6434.11 ms │                   6287.72 ms │     no change │
│ QQuery 35    │  1981.57 ms │                   1948.81 ms │     no change │
│ QQuery 36    │    68.07 ms │                     65.34 ms │     no change │
│ QQuery 37    │    47.89 ms │                     45.68 ms │     no change │
│ QQuery 38    │    68.16 ms │                     66.41 ms │     no change │
│ QQuery 39    │   107.73 ms │                    102.58 ms │     no change │
│ QQuery 40    │    27.87 ms │                     26.80 ms │     no change │
│ QQuery 41    │    24.30 ms │                     24.12 ms │     no change │
│ QQuery 42    │    20.82 ms │                     20.84 ms │     no change │
└──────────────┴─────────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 101756.86ms │
│ Total Time (alamb_optimize_the_byte_view)   │  94273.20ms │
│ Average Time (HEAD)                         │   2366.44ms │
│ Average Time (alamb_optimize_the_byte_view) │   2192.40ms │
│ Queries Faster                              │           8 │
│ Queries Slower                              │           1 │
│ Queries with No Change                      │          34 │
│ Queries with Failure                        │           0 │
└─────────────────────────────────────────────┴─────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_optimize_the_byte_view ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 143.59 ms │                    116.45 ms │ +1.23x faster │
│ QQuery 2     │  28.01 ms │                     29.39 ms │     no change │
│ QQuery 3     │  34.93 ms │                     39.75 ms │  1.14x slower │
│ QQuery 4     │  29.40 ms │                     29.30 ms │     no change │
│ QQuery 5     │  90.93 ms │                     90.67 ms │     no change │
│ QQuery 6     │  20.10 ms │                     19.87 ms │     no change │
│ QQuery 7     │ 253.07 ms │                    240.28 ms │ +1.05x faster │
│ QQuery 8     │  39.41 ms │                     39.42 ms │     no change │
│ QQuery 9     │ 107.87 ms │                    113.96 ms │  1.06x slower │
│ QQuery 10    │  66.53 ms │                     65.10 ms │     no change │
│ QQuery 11    │  17.69 ms │                     20.24 ms │  1.14x slower │
│ QQuery 12    │  52.29 ms │                     52.81 ms │     no change │
│ QQuery 13    │  49.88 ms │                     48.08 ms │     no change │
│ QQuery 14    │  14.31 ms │                     13.94 ms │     no change │
│ QQuery 15    │  25.76 ms │                     25.18 ms │     no change │
│ QQuery 16    │  25.33 ms │                     25.02 ms │     no change │
│ QQuery 17    │ 157.64 ms │                    158.63 ms │     no change │
│ QQuery 18    │ 285.29 ms │                    289.96 ms │     no change │
│ QQuery 19    │  38.48 ms │                     39.13 ms │     no change │
│ QQuery 20    │  52.35 ms │                     53.25 ms │     no change │
│ QQuery 21    │ 338.74 ms │                    333.22 ms │     no change │
│ QQuery 22    │  18.10 ms │                     18.45 ms │     no change │
└──────────────┴───────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                           │ 1889.72ms │
│ Total Time (alamb_optimize_the_byte_view)   │ 1862.10ms │
│ Average Time (HEAD)                         │   85.90ms │
│ Average Time (alamb_optimize_the_byte_view) │   84.64ms │
│ Queries Faster                              │         2 │
│ Queries Slower                              │         3 │
│ Queries with No Change                      │        17 │
│ Queries with Failure                        │         0 │
└─────────────────────────────────────────────┴───────────┘

alamb · 2025-12-16T21:27:38Z

I broke the first part of this PR into

Optimize muti-column grouping with StringView/ByteView #19364

Will put the hash improvements in their own PR

Dandandan · 2025-12-17T07:36:47Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/bytes_view.rs

-            return result;
+            result
+        } else {
+            self.do_equal_to_inner_values_only(lhs_row, array, rhs_row)


perhaps this could be optimized as well (in the outer loop) for empty data buffers

I will give it a try

Let's move this conversation to #19364

Dandandan · 2025-12-17T07:39:15Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/bytes_view.rs

+        );

+        // if the input array has no nulls, can skip null check
+        for (&lhs_row, &rhs_row, equal_to_result) in iter {


I wonder if we can get this to compile to a well-vectorized loop.

alamb · 2025-12-17T12:31:57Z

I with the specialized hash function it gets even faster


time ~/Software/datafusion2/target/profiling/datafusion-cli  -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"
DataFusion CLI v51.0.0
+--------------+--------------+-------------+
| l_returnflag | l_linestatus | count_order |
+--------------+--------------+-------------+
| N            | F            | 3864590     |
| A            | F            | 148047881   |
| R            | F            | 148067261   |
| N            | O            | 300058170   |
+--------------+--------------+-------------+
4 row(s) fetched.
Elapsed 1.024 seconds.


real	0m1.080s
user	0m12.899s
sys	0m0.678s

alamb · 2025-12-17T12:32:08Z

run benchmark tpch_mem

alamb · 2025-12-17T12:34:49Z

show benchmark queue

alamb-ghbot · 2025-12-17T12:34:52Z

🤖 Hi @alamb, you asked to view the benchmark queue (#19344 (comment)).

Job	User	Benchmarks	Comment
`19364_3664020397.sh`	Dandandan	default	`https://github.com/apache/datafusion/pull/19364#issuecomment-3664020397`
`arrow-8990-3662017225.sh`	alamb	view_types concatenate_kernel	`https://github.com/apache/arrow-rs/pull/8990#issuecomment-3662017225`
`19344_3665142711.sh`	alamb	default	`https://github.com/apache/datafusion/pull/19344#issuecomment-3665142711`

alamb-ghbot · 2025-12-17T12:51:22Z

Benchmark script failed with exit code 2.

Last 10 lines of output:

Click to expand

this is  a purposely a feailure
ls: cannot access 'gaaa': No such file or directory

alamb-ghbot · 2025-12-17T14:30:41Z

Benchmark script failed with exit code 2.

Last 10 lines of output:

Click to expand

this is  a purposely a feailure
ls: cannot access 'gaaa': No such file or directory

alamb · 2025-12-17T22:59:10Z

I have broken the changes in this PR into two chunks:

So closing this one down to work on them

@camuel

…5% faster (#19413) ## Which issue does this PR close?  - Part of #18411 - Closes #19344 - Closes #19364 Note this is an alternate to #19364 ## Rationale for this change @camuel found a query where DuckDB's raw grouping is is faster. I looked into it and much of the difference can be explained by better vectorization in the comparisons and short string optimizations ## What changes are included in this PR? Optimize (will comment inline) ## Are these changes tested? By CI. See also benchmark results below. I tested manually as well Create Data: ```shell nice tpchgen-cli --tables=lineitem --format=parquet --scale-factor 100 ``` Run query: ```shell hyperfine --warmup 3 " datafusion-cli -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" " ``` Before (main): 1.368s ```shell Benchmark 1: datafusion-cli -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;" Time (mean ± σ): 1.393 s ± 0.020 s [User: 16.778 s, System: 0.688 s] Range (min … max): 1.368 s … 1.438 s 10 runs ``` After (this PR) 1.022s ```shell Benchmark 1: ./datafusion-cli-multi-gby-try2 -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;" Time (mean ± σ): 1.022 s ± 0.015 s [User: 11.685 s, System: 0.644 s] Range (min … max): 1.005 s … 1.052 s 10 runs ``` I have a PR that improves string view hashing performance too, see - #19374 ## Are there any user-facing changes? Faster performance

## Which issue does this PR close? - builds on #19373 - part of #18411 - Broken out of #19344 - Closes #19344 ## Rationale for this change While looking at performance as part of #18411, I noticed we could speed up string view hashing by optimizing for small strings ## What changes are included in this PR? Optimize StringView hashing, specifically by using the inlined view for short strings ## Are these changes tested? Functionally by existing coverage Performance by benchmarks (added in #19373) which show * 15%-20% faster for mixed short/long strings * 50%-70% faster for "short" arrays where we know there are no strings longer than 12 bytes ``` utf8_view (small): multiple, no nulls 1.00 47.9±1.71µs ? ?/sec 4.00 191.6±1.15µs ? ?/sec utf8_view (small): multiple, nulls 1.00 78.4±0.48µs ? ?/sec 3.08 241.6±1.11µs ? ?/sec utf8_view (small): single, no nulls 1.00 13.9±0.19µs ? ?/sec 4.29 59.7±0.30µs ? ?/sec utf8_view (small): single, nulls 1.00 23.8±0.20µs ? ?/sec 3.10 73.7±1.03µs ? ?/sec utf8_view: multiple, no nulls 1.00 235.4±2.14µs ? ?/sec 1.11 262.2±1.34µs ? ?/sec utf8_view: multiple, nulls 1.00 227.2±2.11µs ? ?/sec 1.34 303.9±2.23µs ? ?/sec utf8_view: single, no nulls 1.00 71.6±0.74µs ? ?/sec 1.05 75.2±1.27µs ? ?/sec utf8_view: single, nulls 1.00 71.5±1.92µs ? ?/sec 1.28 91.6±4.65µs ``` <details><summary>Details</summary> <p> ``` Gnuplot not found, using plotters backend utf8_view: single, no nulls time: [20.872 µs 20.906 µs 20.944 µs] change: [−15.863% −15.614% −15.331%] (p = 0.00 < 0.05) Performance has improved. Found 13 outliers among 100 measurements (13.00%) 8 (8.00%) high mild 5 (5.00%) high severe utf8_view: single, nulls time: [22.968 µs 23.050 µs 23.130 µs] change: [−17.796% −17.384% −16.918%] (p = 0.00 < 0.05) Performance has improved. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe utf8_view: multiple, no nulls time: [66.005 µs 66.155 µs 66.325 µs] change: [−19.077% −18.785% −18.512%] (p = 0.00 < 0.05) Performance has improved. utf8_view: multiple, nulls time: [72.155 µs 72.375 µs 72.649 µs] change: [−17.944% −17.612% −17.266%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 6 (6.00%) high mild 5 (5.00%) high severe utf8_view (small): single, no nulls time: [6.1401 µs 6.1563 µs 6.1747 µs] change: [−69.623% −69.484% −69.333%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 3 (3.00%) high mild 3 (3.00%) high severe utf8_view (small): single, nulls time: [10.234 µs 10.250 µs 10.270 µs] change: [−53.969% −53.815% −53.666%] (p = 0.00 < 0.05) Performance has improved. Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high severe utf8_view (small): multiple, no nulls time: [20.853 µs 20.905 µs 20.961 µs] change: [−66.006% −65.883% −65.759%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 7 (7.00%) high mild 2 (2.00%) high severe utf8_view (small): multiple, nulls time: [32.519 µs 32.600 µs 32.675 µs] change: [−53.937% −53.581% −53.232%] (p = 0.00 < 0.05) Performance has improved. Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high mild ``` </p> </details> ## Are there any user-facing changes?

Optimize byte view comparison in groupby

dee4cb8

github-actions bot added the physical-plan Changes to the physical-plan crate label Dec 15, 2025

Special case nulls

91ac482

Add specialized hash

9a83e18

github-actions bot added the common Related to common crate label Dec 16, 2025

Dandandan reviewed Dec 16, 2025

View reviewed changes

Specialize hash when no buffers

d4eaebb

Try and improve speed of hashing string view

ec6b499

alamb mentioned this pull request Dec 16, 2025

TPCH q1 with no predicates is 2x slower than duckdb #18411

Open

alamb changed the title ~~Optimize byte view comparison in groupby~~ Optimize byte view comparison in multi groupby Dec 16, 2025

alamb added 2 commits December 16, 2025 09:36

rearrange to help inlining

7c999f7

fmt

3031658

alamb mentioned this pull request Dec 16, 2025

Optimize muti-column grouping with StringView/ByteView #19364

Closed

Dandandan reviewed Dec 17, 2025

View reviewed changes

alamb changed the title ~~Optimize byte view comparison in multi groupby~~ TESTING: Optimize byte view comparison in multi groupby Dec 17, 2025

alamb mentioned this pull request Dec 17, 2025

Optimize hashing for StringView and ByteView (15-70% faster) #19374

Merged

alamb closed this Dec 17, 2025

alamb mentioned this pull request Dec 19, 2025

Optimize muti-column grouping with StringView/ByteView (option 2) - 25% faster #19413

Merged

TESTING: Optimize byte view comparison in multi groupby #19344

TESTING: Optimize byte view comparison in multi groupby #19344

Uh oh!

Conversation

alamb commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Dec 15, 2025

Uh oh!

alamb commented Dec 15, 2025

Uh oh!

Dandandan commented Dec 16, 2025

Uh oh!

Dandandan Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

alamb-ghbot commented Dec 16, 2025

Uh oh!

alamb-ghbot commented Dec 16, 2025

Uh oh!

Dandandan commented Dec 16, 2025

Uh oh!

alamb commented Dec 16, 2025

Uh oh!

alamb commented Dec 16, 2025

Uh oh!

alamb-ghbot commented Dec 16, 2025

Uh oh!

alamb-ghbot commented Dec 16, 2025

Uh oh!

alamb-ghbot commented Dec 16, 2025

Uh oh!

alamb-ghbot commented Dec 16, 2025

Uh oh!

alamb commented Dec 16, 2025

Uh oh!

Dandandan Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Dandandan Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 17, 2025

Uh oh!

alamb commented Dec 17, 2025

Uh oh!

alamb commented Dec 17, 2025

Uh oh!

alamb-ghbot commented Dec 17, 2025

Uh oh!

alamb-ghbot commented Dec 17, 2025

Uh oh!

alamb-ghbot commented Dec 17, 2025

Uh oh!

alamb commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alamb commented Dec 15, 2025 •

edited

Loading