Optimize muti-column grouping with StringView/ByteView #19364

alamb · 2025-12-16T21:16:05Z

Which issue does this PR close?

Rationale for this change

@camuel found a query where DuckDB's raw grouping is is faster.

I looked into it and much of the difference can be explained by better vectorization in the comparisons and short string optimizations

What changes are included in this PR?

Optimize (will comment inline)

Are these changes tested?

By CI. See also benchmark results below. I tested manually as well

Create Data:

nice tpchgen-cli --tables=lineitem --format=parquet --scale-factor 100

Run query:

time datafusion-cli  -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"

Before (main)

real	0m1.320s
user	0m16.887s
sys	0m0.669s

After (this PR)

real	0m1.037s
user	0m12.365s
sys	0m0.682s

I have some thoughts to improve string view hashing performance too -- will make as a separate PR

Are there any user-facing changes?

Faster performance

alamb · 2025-12-16T21:17:03Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/bytes_view.rs


-    fn vectorized_equal_to_inner(
+    /// Comparison when there are no nulls in array
+    fn vectorized_equal_to_no_nulls(


Change 1 is to create a second copy of the loop when there are no nulls (to avoid the null check and give LLVM a better chance to optimize)

alamb · 2025-12-16T21:17:19Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/bytes_view.rs

+        array: &GenericByteViewArray<B>,
+        rhs_row: usize,
+    ) -> bool {
+        // SAFETY: the row indexes passed to vectorized_equal are in bounds


Optimization 2: skip bounds check on data access

alamb · 2025-12-16T21:17:40Z

datafusion/physical-plan/src/aggregates/group_values/multi_group_by/bytes_view.rs

-                )
-            };
-            exist_inline == input_inline
+            // the views are inlined and the lengths are equal, so just


Optimization 3: just compare the view directly rather than breaking it into parts first

alamb · 2025-12-16T21:17:58Z

run benchmarks

alamb · 2025-12-16T21:18:23Z

run benchmark tpch

Dandandan · 2025-12-17T07:22:50Z

run benchmarks

alamb-ghbot · 2025-12-17T11:32:06Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/multi_byte_view_comparison (39690eb) to 50d20dd diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-12-17T11:51:20Z

My runner script keeps dying when there is a problem with the scripts. I am working on a way to keep it going (make it more resilent to failures)

alamb-ghbot · 2025-12-17T12:10:57Z

🤖: Benchmark completed

Details

Comparing HEAD and alamb_multi_byte_view_comparison
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_multi_byte_view_comparison ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2660.36 ms │                       2601.22 ms │ no change │
│ QQuery 1     │  1286.95 ms │                       1310.89 ms │ no change │
│ QQuery 2     │  2550.99 ms │                       2486.51 ms │ no change │
│ QQuery 3     │  1154.27 ms │                       1105.38 ms │ no change │
│ QQuery 4     │  2233.04 ms │                       2244.30 ms │ no change │
│ QQuery 5     │ 28548.05 ms │                      28375.50 ms │ no change │
│ QQuery 6     │  3979.59 ms │                       3939.18 ms │ no change │
│ QQuery 7     │  3379.67 ms │                       3441.10 ms │ no change │
└──────────────┴─────────────┴──────────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                               ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                               │ 45792.92ms │
│ Total Time (alamb_multi_byte_view_comparison)   │ 45504.08ms │
│ Average Time (HEAD)                             │  5724.11ms │
│ Average Time (alamb_multi_byte_view_comparison) │  5688.01ms │
│ Queries Faster                                  │          0 │
│ Queries Slower                                  │          0 │
│ Queries with No Change                          │          8 │
│ Queries with Failure                            │          0 │
└─────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_multi_byte_view_comparison ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.46 ms │                          2.25 ms │ +1.10x faster │
│ QQuery 1     │    49.63 ms │                         50.76 ms │     no change │
│ QQuery 2     │   130.55 ms │                        134.38 ms │     no change │
│ QQuery 3     │   155.21 ms │                        154.41 ms │     no change │
│ QQuery 4     │  1058.23 ms │                       1044.99 ms │     no change │
│ QQuery 5     │  1427.87 ms │                       1458.75 ms │     no change │
│ QQuery 6     │     2.08 ms │                          2.13 ms │     no change │
│ QQuery 7     │    57.50 ms │                         57.19 ms │     no change │
│ QQuery 8     │  1364.42 ms │                       1357.62 ms │     no change │
│ QQuery 9     │  1763.00 ms │                       1810.69 ms │     no change │
│ QQuery 10    │   386.07 ms │                        379.89 ms │     no change │
│ QQuery 11    │   433.70 ms │                        419.83 ms │     no change │
│ QQuery 12    │  1325.84 ms │                       1333.81 ms │     no change │
│ QQuery 13    │  2025.58 ms │                       2010.85 ms │     no change │
│ QQuery 14    │  1265.55 ms │                       1216.11 ms │     no change │
│ QQuery 15    │  1187.30 ms │                       1182.08 ms │     no change │
│ QQuery 16    │  2590.85 ms │                       2521.91 ms │     no change │
│ QQuery 17    │  2566.20 ms │                       2490.35 ms │     no change │
│ QQuery 18    │  4892.63 ms │                       4738.91 ms │     no change │
│ QQuery 19    │   119.61 ms │                        121.86 ms │     no change │
│ QQuery 20    │  1875.55 ms │                       1837.59 ms │     no change │
│ QQuery 21    │  2175.43 ms │                       2173.27 ms │     no change │
│ QQuery 22    │  3739.74 ms │                       3703.37 ms │     no change │
│ QQuery 23    │ 12492.36 ms │                      12431.98 ms │     no change │
│ QQuery 24    │   216.07 ms │                        209.73 ms │     no change │
│ QQuery 25    │   474.88 ms │                        481.87 ms │     no change │
│ QQuery 26    │   216.52 ms │                        221.48 ms │     no change │
│ QQuery 27    │  2699.74 ms │                       2686.46 ms │     no change │
│ QQuery 28    │ 24092.13 ms │                      23987.16 ms │     no change │
│ QQuery 29    │   978.97 ms │                        953.06 ms │     no change │
│ QQuery 30    │  1296.55 ms │                       1318.28 ms │     no change │
│ QQuery 31    │  1297.85 ms │                       1338.78 ms │     no change │
│ QQuery 32    │  4919.38 ms │                       4499.39 ms │ +1.09x faster │
│ QQuery 33    │  5814.56 ms │                       5724.61 ms │     no change │
│ QQuery 34    │  5769.92 ms │                       5726.25 ms │     no change │
│ QQuery 35    │  1884.52 ms │                       1873.58 ms │     no change │
│ QQuery 36    │    66.11 ms │                         68.66 ms │     no change │
│ QQuery 37    │    45.93 ms │                         44.88 ms │     no change │
│ QQuery 38    │    63.87 ms │                         70.53 ms │  1.10x slower │
│ QQuery 39    │    98.66 ms │                        103.01 ms │     no change │
│ QQuery 40    │    27.37 ms │                         26.79 ms │     no change │
│ QQuery 41    │    23.14 ms │                         23.49 ms │     no change │
│ QQuery 42    │    19.29 ms │                         19.89 ms │     no change │
└──────────────┴─────────────┴──────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                               ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                               │ 93092.83ms │
│ Total Time (alamb_multi_byte_view_comparison)   │ 92012.89ms │
│ Average Time (HEAD)                             │  2164.95ms │
│ Average Time (alamb_multi_byte_view_comparison) │  2139.83ms │
│ Queries Faster                                  │          2 │
│ Queries Slower                                  │          1 │
│ Queries with No Change                          │         40 │
│ Queries with Failure                            │          0 │
└─────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_multi_byte_view_comparison ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 143.22 ms │                        118.24 ms │ +1.21x faster │
│ QQuery 2     │  29.31 ms │                         29.34 ms │     no change │
│ QQuery 3     │  36.72 ms │                         38.94 ms │  1.06x slower │
│ QQuery 4     │  29.68 ms │                         28.68 ms │     no change │
│ QQuery 5     │  90.00 ms │                         87.45 ms │     no change │
│ QQuery 6     │  24.32 ms │                         19.99 ms │ +1.22x faster │
│ QQuery 7     │ 273.16 ms │                        226.90 ms │ +1.20x faster │
│ QQuery 8     │  43.96 ms │                         34.12 ms │ +1.29x faster │
│ QQuery 9     │ 116.15 ms │                        106.64 ms │ +1.09x faster │
│ QQuery 10    │  75.61 ms │                         62.61 ms │ +1.21x faster │
│ QQuery 11    │  19.65 ms │                         19.71 ms │     no change │
│ QQuery 12    │  63.90 ms │                         51.38 ms │ +1.24x faster │
│ QQuery 13    │  56.73 ms │                         48.39 ms │ +1.17x faster │
│ QQuery 14    │  16.31 ms │                         14.01 ms │ +1.16x faster │
│ QQuery 15    │  27.79 ms │                         24.65 ms │ +1.13x faster │
│ QQuery 16    │  26.23 ms │                         24.79 ms │ +1.06x faster │
│ QQuery 17    │ 179.65 ms │                        149.04 ms │ +1.21x faster │
│ QQuery 18    │ 294.61 ms │                        276.83 ms │ +1.06x faster │
│ QQuery 19    │  45.16 ms │                         37.14 ms │ +1.22x faster │
│ QQuery 20    │  49.01 ms │                         50.20 ms │     no change │
│ QQuery 21    │ 327.96 ms │                        301.84 ms │ +1.09x faster │
│ QQuery 22    │  17.25 ms │                         17.67 ms │     no change │
└──────────────┴───────────┴──────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                               ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                               │ 1986.39ms │
│ Total Time (alamb_multi_byte_view_comparison)   │ 1768.54ms │
│ Average Time (HEAD)                             │   90.29ms │
│ Average Time (alamb_multi_byte_view_comparison) │   80.39ms │
│ Queries Faster                                  │        15 │
│ Queries Slower                                  │         1 │
│ Queries with No Change                          │         6 │
│ Queries with Failure                            │         0 │
└─────────────────────────────────────────────────┴───────────┘

alamb-ghbot · 2025-12-17T12:11:01Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/multi_byte_view_comparison (39690eb) to 50d20dd diff using: tpch
Results will be posted here when complete

alamb-ghbot · 2025-12-17T12:11:38Z

🤖: Benchmark completed

Details

Comparing HEAD and alamb_multi_byte_view_comparison
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_multi_byte_view_comparison ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 212.24 ms │                        197.29 ms │ +1.08x faster │
│ QQuery 2     │  97.69 ms │                         94.62 ms │     no change │
│ QQuery 3     │ 120.75 ms │                        123.57 ms │     no change │
│ QQuery 4     │  75.88 ms │                         76.94 ms │     no change │
│ QQuery 5     │ 175.33 ms │                        175.18 ms │     no change │
│ QQuery 6     │  67.79 ms │                         67.88 ms │     no change │
│ QQuery 7     │ 207.70 ms │                        209.71 ms │     no change │
│ QQuery 8     │ 165.65 ms │                        165.16 ms │     no change │
│ QQuery 9     │ 223.06 ms │                        224.60 ms │     no change │
│ QQuery 10    │ 183.75 ms │                        183.52 ms │     no change │
│ QQuery 11    │  74.98 ms │                         73.93 ms │     no change │
│ QQuery 12    │ 119.37 ms │                        118.45 ms │     no change │
│ QQuery 13    │ 215.24 ms │                        213.45 ms │     no change │
│ QQuery 14    │  93.30 ms │                         88.17 ms │ +1.06x faster │
│ QQuery 15    │ 122.27 ms │                        123.95 ms │     no change │
│ QQuery 16    │  58.69 ms │                         58.23 ms │     no change │
│ QQuery 17    │ 269.21 ms │                        273.34 ms │     no change │
│ QQuery 18    │ 307.57 ms │                        310.35 ms │     no change │
│ QQuery 19    │ 136.22 ms │                        134.94 ms │     no change │
│ QQuery 20    │ 124.35 ms │                        123.55 ms │     no change │
│ QQuery 21    │ 266.98 ms │                        265.49 ms │     no change │
│ QQuery 22    │  43.72 ms │                         42.24 ms │     no change │
└──────────────┴───────────┴──────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                               ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                               │ 3361.73ms │
│ Total Time (alamb_multi_byte_view_comparison)   │ 3344.56ms │
│ Average Time (HEAD)                             │  152.81ms │
│ Average Time (alamb_multi_byte_view_comparison) │  152.03ms │
│ Queries Faster                                  │         2 │
│ Queries Slower                                  │         0 │
│ Queries with No Change                          │        20 │
│ Queries with Failure                            │         0 │
└─────────────────────────────────────────────────┴───────────┘

alamb-ghbot · 2025-12-17T12:11:41Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/multi_byte_view_comparison (39690eb) to 50d20dd diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-12-17T12:16:32Z

TPCH Q1 does have the pattern that is optimized in this PR (multiple group by columns) so it is plausible that the benefits are measured there

datafusion/benchmarks/queries/q1.sql

Lines 1 to 21 in 32951c3

    
           select 
        
               l_returnflag, 
        
               l_linestatus, 
        
               sum(l_quantity) as sum_qty, 
        
               sum(l_extendedprice) as sum_base_price, 
        
               sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, 
        
               sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge, 
        
               avg(l_quantity) as avg_qty, 
        
               avg(l_extendedprice) as avg_price, 
        
               avg(l_discount) as avg_disc, 
        
               count(*) as count_order 
        
           from 
        
               lineitem 
        
           where 
        
                   l_shipdate <= date '1998-09-02' 
        
           group by 
        
               l_returnflag, 
        
               l_linestatus 
        
           order by 
        
               l_returnflag, 
        
               l_linestatus;

alamb · 2025-12-17T12:31:10Z

My runner script keeps dying when there is a problem with the scripts. I am working on a way to keep it going (make it more resilent to failures)

I added some error checking here: alamb/datafusion-benchmarking@64ebd3a

alamb-ghbot · 2025-12-17T12:37:19Z

🤖: Benchmark completed

Details

Comparing HEAD and alamb_multi_byte_view_comparison
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_multi_byte_view_comparison ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2654.93 ms │                       2698.13 ms │ no change │
│ QQuery 1     │  1318.69 ms │                       1258.88 ms │ no change │
│ QQuery 2     │  2438.94 ms │                       2526.07 ms │ no change │
│ QQuery 3     │  1132.09 ms │                       1130.33 ms │ no change │
│ QQuery 4     │  2250.77 ms │                       2241.25 ms │ no change │
│ QQuery 5     │ 28459.49 ms │                      28764.05 ms │ no change │
│ QQuery 6     │  3966.94 ms │                       3978.11 ms │ no change │
│ QQuery 7     │  3455.76 ms │                       3546.16 ms │ no change │
└──────────────┴─────────────┴──────────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                               ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                               │ 45677.61ms │
│ Total Time (alamb_multi_byte_view_comparison)   │ 46142.98ms │
│ Average Time (HEAD)                             │  5709.70ms │
│ Average Time (alamb_multi_byte_view_comparison) │  5767.87ms │
│ Queries Faster                                  │          0 │
│ Queries Slower                                  │          0 │
│ Queries with No Change                          │          8 │
│ Queries with Failure                            │          0 │
└─────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_multi_byte_view_comparison ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.45 ms │                          2.24 ms │ +1.09x faster │
│ QQuery 1     │    52.67 ms │                         51.50 ms │     no change │
│ QQuery 2     │   134.23 ms │                        134.41 ms │     no change │
│ QQuery 3     │   151.39 ms │                        154.17 ms │     no change │
│ QQuery 4     │  1069.20 ms │                       1251.10 ms │  1.17x slower │
│ QQuery 5     │  1457.17 ms │                       1737.72 ms │  1.19x slower │
│ QQuery 6     │     2.20 ms │                          2.12 ms │     no change │
│ QQuery 7     │    57.10 ms │                         55.54 ms │     no change │
│ QQuery 8     │  1418.06 ms │                       1568.71 ms │  1.11x slower │
│ QQuery 9     │  1820.46 ms │                       1982.91 ms │  1.09x slower │
│ QQuery 10    │   383.95 ms │                        381.85 ms │     no change │
│ QQuery 11    │   436.87 ms │                        429.02 ms │     no change │
│ QQuery 12    │  1364.82 ms │                       1644.13 ms │  1.20x slower │
│ QQuery 13    │  2030.01 ms │                       2213.73 ms │  1.09x slower │
│ QQuery 14    │  1245.92 ms │                       1399.10 ms │  1.12x slower │
│ QQuery 15    │  1220.64 ms │                       1433.16 ms │  1.17x slower │
│ QQuery 16    │  2610.27 ms │                       2731.04 ms │     no change │
│ QQuery 17    │  2593.14 ms │                       2730.75 ms │  1.05x slower │
│ QQuery 18    │  5201.98 ms │                       5053.13 ms │     no change │
│ QQuery 19    │   123.45 ms │                        124.49 ms │     no change │
│ QQuery 20    │  1942.33 ms │                       1904.46 ms │     no change │
│ QQuery 21    │  2212.15 ms │                       2192.22 ms │     no change │
│ QQuery 22    │  3829.95 ms │                       3806.20 ms │     no change │
│ QQuery 23    │ 16774.03 ms │                      12620.76 ms │ +1.33x faster │
│ QQuery 24    │   214.67 ms │                        222.76 ms │     no change │
│ QQuery 25    │   469.39 ms │                        488.55 ms │     no change │
│ QQuery 26    │   238.30 ms │                        210.19 ms │ +1.13x faster │
│ QQuery 27    │  2733.25 ms │                       2671.62 ms │     no change │
│ QQuery 28    │ 24468.41 ms │                      24083.89 ms │     no change │
│ QQuery 29    │   953.81 ms │                        970.17 ms │     no change │
│ QQuery 30    │  1358.21 ms │                       1374.33 ms │     no change │
│ QQuery 31    │  1386.23 ms │                       1382.56 ms │     no change │
│ QQuery 32    │  5287.29 ms │                       4855.42 ms │ +1.09x faster │
│ QQuery 33    │  5831.31 ms │                       5784.91 ms │     no change │
│ QQuery 34    │  6047.35 ms │                       5824.61 ms │     no change │
│ QQuery 35    │  1929.25 ms │                       2081.71 ms │  1.08x slower │
│ QQuery 36    │    70.23 ms │                         68.42 ms │     no change │
│ QQuery 37    │    46.23 ms │                         47.32 ms │     no change │
│ QQuery 38    │    67.06 ms │                         68.95 ms │     no change │
│ QQuery 39    │   104.26 ms │                        107.45 ms │     no change │
│ QQuery 40    │    27.77 ms │                         26.43 ms │     no change │
│ QQuery 41    │    23.46 ms │                         24.41 ms │     no change │
│ QQuery 42    │    20.52 ms │                         20.13 ms │     no change │
└──────────────┴─────────────┴──────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                               ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                               │ 99411.41ms │
│ Total Time (alamb_multi_byte_view_comparison)   │ 95918.30ms │
│ Average Time (HEAD)                             │  2311.89ms │
│ Average Time (alamb_multi_byte_view_comparison) │  2230.66ms │
│ Queries Faster                                  │          4 │
│ Queries Slower                                  │         10 │
│ Queries with No Change                          │         29 │
│ Queries with Failure                            │          0 │
└─────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_multi_byte_view_comparison ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 131.95 ms │                        118.14 ms │ +1.12x faster │
│ QQuery 2     │  28.62 ms │                         28.00 ms │     no change │
│ QQuery 3     │  38.36 ms │                         37.47 ms │     no change │
│ QQuery 4     │  29.07 ms │                         28.98 ms │     no change │
│ QQuery 5     │  87.08 ms │                         88.72 ms │     no change │
│ QQuery 6     │  19.72 ms │                         19.91 ms │     no change │
│ QQuery 7     │ 230.33 ms │                        228.48 ms │     no change │
│ QQuery 8     │  35.99 ms │                         38.56 ms │  1.07x slower │
│ QQuery 9     │ 109.87 ms │                        113.87 ms │     no change │
│ QQuery 10    │  64.34 ms │                         64.48 ms │     no change │
│ QQuery 11    │  17.76 ms │                         19.00 ms │  1.07x slower │
│ QQuery 12    │  51.67 ms │                         51.38 ms │     no change │
│ QQuery 13    │  48.76 ms │                         52.19 ms │  1.07x slower │
│ QQuery 14    │  14.23 ms │                         14.00 ms │     no change │
│ QQuery 15    │  24.49 ms │                         26.71 ms │  1.09x slower │
│ QQuery 16    │  26.54 ms │                         24.76 ms │ +1.07x faster │
│ QQuery 17    │ 151.67 ms │                        154.51 ms │     no change │
│ QQuery 18    │ 280.64 ms │                        283.08 ms │     no change │
│ QQuery 19    │  37.06 ms │                         37.22 ms │     no change │
│ QQuery 20    │  50.88 ms │                         48.96 ms │     no change │
│ QQuery 21    │ 318.31 ms │                        316.90 ms │     no change │
│ QQuery 22    │  17.58 ms │                         17.90 ms │     no change │
└──────────────┴───────────┴──────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                               ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                               │ 1814.93ms │
│ Total Time (alamb_multi_byte_view_comparison)   │ 1813.21ms │
│ Average Time (HEAD)                             │   82.50ms │
│ Average Time (alamb_multi_byte_view_comparison) │   82.42ms │
│ Queries Faster                                  │         2 │
│ Queries Slower                                  │         4 │
│ Queries with No Change                          │        16 │
│ Queries with Failure                            │         0 │
└─────────────────────────────────────────────────┴───────────┘

alamb · 2025-12-17T22:59:41Z

Marking back to draft as I try out some other ideas from @Dandandan

Dandandan · 2025-12-18T06:14:37Z

run benchmark tpcds

alamb-ghbot · 2025-12-18T06:14:45Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/multi_byte_view_comparison (da34c4c) to 14cd71e diff using: tpcds
Results will be posted here when complete

alamb-ghbot · 2025-12-18T06:38:12Z

🤖: Benchmark completed

Details

Comparing HEAD and alamb_multi_byte_view_comparison
--------------------
Benchmark tpcds_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_multi_byte_view_comparison ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │    61.79 ms │                         61.64 ms │     no change │
│ QQuery 2     │   203.56 ms │                        200.17 ms │     no change │
│ QQuery 3     │   153.87 ms │                        159.01 ms │     no change │
│ QQuery 4     │  2033.58 ms │                       1931.14 ms │ +1.05x faster │
│ QQuery 5     │   267.11 ms │                        262.78 ms │     no change │
│ QQuery 6     │  1550.81 ms │                       1531.81 ms │     no change │
│ QQuery 7     │   502.46 ms │                        495.18 ms │     no change │
│ QQuery 8     │   168.22 ms │                        169.04 ms │     no change │
│ QQuery 9     │   267.41 ms │                        268.63 ms │     no change │
│ QQuery 10    │   165.27 ms │                        165.74 ms │     no change │
│ QQuery 11    │  1380.34 ms │                       1378.82 ms │     no change │
│ QQuery 12    │    70.46 ms │                         73.32 ms │     no change │
│ QQuery 13    │   555.03 ms │                        555.82 ms │     no change │
│ QQuery 14    │  1991.64 ms │                       2017.76 ms │     no change │
│ QQuery 15    │    27.90 ms │                         28.44 ms │     no change │
│ QQuery 16    │    59.48 ms │                         59.13 ms │     no change │
│ QQuery 17    │   351.16 ms │                        346.88 ms │     no change │
│ QQuery 18    │   189.18 ms │                        193.45 ms │     no change │
│ QQuery 19    │   227.42 ms │                        232.14 ms │     no change │
│ QQuery 20    │    22.62 ms │                         23.24 ms │     no change │
│ QQuery 21    │    34.71 ms │                         34.96 ms │     no change │
│ QQuery 22    │   971.80 ms │                        902.51 ms │ +1.08x faster │
│ QQuery 23    │  1812.07 ms │                       1801.98 ms │     no change │
│ QQuery 24    │   633.93 ms │                        632.98 ms │     no change │
│ QQuery 25    │   506.75 ms │                        512.24 ms │     no change │
│ QQuery 26    │   122.62 ms │                        128.65 ms │     no change │
│ QQuery 27    │   488.57 ms │                        494.95 ms │     no change │
│ QQuery 28    │   302.27 ms │                        302.85 ms │     no change │
│ QQuery 29    │   440.73 ms │                        436.42 ms │     no change │
│ QQuery 30    │    63.60 ms │                         65.63 ms │     no change │
│ QQuery 31    │   297.01 ms │                        302.12 ms │     no change │
│ QQuery 32    │    76.73 ms │                         80.40 ms │     no change │
│ QQuery 33    │   192.92 ms │                        189.27 ms │     no change │
│ QQuery 34    │   158.22 ms │                        162.63 ms │     no change │
│ QQuery 35    │   169.30 ms │                        165.74 ms │     no change │
│ QQuery 36    │   289.25 ms │                        292.32 ms │     no change │
│ QQuery 37    │   257.86 ms │                        260.83 ms │     no change │
│ QQuery 38    │   146.98 ms │                        145.08 ms │     no change │
│ QQuery 39    │   213.20 ms │                        216.10 ms │     no change │
│ QQuery 40    │   188.79 ms │                        188.73 ms │     no change │
│ QQuery 41    │    16.29 ms │                         16.25 ms │     no change │
│ QQuery 42    │   140.42 ms │                        141.79 ms │     no change │
│ QQuery 43    │   124.25 ms │                        124.67 ms │     no change │
│ QQuery 44    │    15.84 ms │                         15.85 ms │     no change │
│ QQuery 45    │    82.22 ms │                         84.98 ms │     no change │
│ QQuery 46    │   325.38 ms │                        328.71 ms │     no change │
│ QQuery 47    │  1314.45 ms │                       1272.21 ms │     no change │
│ QQuery 48    │   416.60 ms │                        417.76 ms │     no change │
│ QQuery 49    │   352.27 ms │                        356.68 ms │     no change │
│ QQuery 50    │   335.37 ms │                        338.14 ms │     no change │
│ QQuery 51    │   295.39 ms │                        299.23 ms │     no change │
│ QQuery 52    │   142.53 ms │                        142.80 ms │     no change │
│ QQuery 53    │   149.14 ms │                        149.82 ms │     no change │
│ QQuery 54    │   207.33 ms │                        213.63 ms │     no change │
│ QQuery 55    │   143.22 ms │                        143.18 ms │     no change │
│ QQuery 56    │   192.50 ms │                        195.91 ms │     no change │
│ QQuery 57    │   315.98 ms │                        316.97 ms │     no change │
│ QQuery 58    │   513.48 ms │                        520.51 ms │     no change │
│ QQuery 59    │   282.42 ms │                        281.60 ms │     no change │
│ QQuery 60    │   198.26 ms │                        196.24 ms │     no change │
│ QQuery 61    │   233.36 ms │                        237.64 ms │     no change │
│ QQuery 62    │  1350.10 ms │                       1342.62 ms │     no change │
│ QQuery 63    │   147.61 ms │                        149.77 ms │     no change │
│ QQuery 64    │  1155.50 ms │                       1162.17 ms │     no change │
│ QQuery 65    │   352.19 ms │                        362.20 ms │     no change │
│ QQuery 66    │   380.25 ms │                        370.56 ms │     no change │
│ QQuery 67    │   630.71 ms │                        610.62 ms │     no change │
│ QQuery 68    │   388.26 ms │                        383.19 ms │     no change │
│ QQuery 69    │   167.36 ms │                        163.01 ms │     no change │
│ QQuery 70    │   511.00 ms │                        496.85 ms │     no change │
│ QQuery 71    │   183.38 ms │                        181.99 ms │     no change │
│ QQuery 72    │  2533.36 ms │                       2500.78 ms │     no change │
│ QQuery 73    │   154.52 ms │                        153.24 ms │     no change │
│ QQuery 74    │   874.07 ms │                        852.25 ms │     no change │
│ QQuery 75    │   394.61 ms │                        400.61 ms │     no change │
│ QQuery 76    │   186.05 ms │                        186.17 ms │     no change │
│ QQuery 77    │   261.26 ms │                        263.80 ms │     no change │
│ QQuery 78    │   937.04 ms │                        930.09 ms │     no change │
│ QQuery 79    │   338.60 ms │                        337.28 ms │     no change │
│ QQuery 80    │   500.89 ms │                        494.00 ms │     no change │
│ QQuery 81    │    42.96 ms │                         41.36 ms │     no change │
│ QQuery 82    │   294.18 ms │                        296.19 ms │     no change │
│ QQuery 83    │    68.78 ms │                         65.59 ms │     no change │
│ QQuery 84    │    64.69 ms │                         63.12 ms │     no change │
│ QQuery 85    │   218.62 ms │                        222.75 ms │     no change │
│ QQuery 86    │    58.20 ms │                         57.96 ms │     no change │
│ QQuery 87    │   151.80 ms │                        144.76 ms │     no change │
│ QQuery 88    │   240.58 ms │                        243.38 ms │     no change │
│ QQuery 89    │   167.52 ms │                        167.34 ms │     no change │
│ QQuery 90    │    36.75 ms │                         34.93 ms │     no change │
│ QQuery 91    │    99.56 ms │                         94.41 ms │ +1.05x faster │
│ QQuery 92    │    77.48 ms │                         78.27 ms │     no change │
│ QQuery 93    │   267.09 ms │                        264.30 ms │     no change │
│ QQuery 94    │    86.23 ms │                         83.12 ms │     no change │
│ QQuery 95    │   261.23 ms │                        261.66 ms │     no change │
│ QQuery 96    │   109.22 ms │                        113.19 ms │     no change │
│ QQuery 97    │   190.34 ms │                        182.64 ms │     no change │
│ QQuery 98    │   239.16 ms │                        233.03 ms │     no change │
│ QQuery 99    │ 14834.27 ms │                      14821.76 ms │     no change │
└──────────────┴─────────────┴──────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                               ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                               │ 52368.73ms │
│ Total Time (alamb_multi_byte_view_comparison)   │ 52076.04ms │
│ Average Time (HEAD)                             │   528.98ms │
│ Average Time (alamb_multi_byte_view_comparison) │   526.02ms │
│ Queries Faster                                  │          3 │
│ Queries Slower                                  │          0 │
│ Queries with No Change                          │         96 │
│ Queries with Failure                            │          0 │
└─────────────────────────────────────────────────┴────────────┘

alamb · 2025-12-18T12:45:28Z

I benchmarked with/without the no buffers optimization and it seems to be worth around 1% (10ms out of 990ms)

hyperfine --warmup 3 " ./datafusion-cli-starting  -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" "

Starting (before da34c4c)

andrewlamb@Andrews-MacBook-Pro-3:~/Downloads$ hyperfine --warmup 3 " ./datafusion-cli-starting  -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" "
Benchmark 1:  ./datafusion-cli-starting  -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"
  Time (mean ± σ):     989.4 ms ±   9.2 ms    [User: 11962.4 ms, System: 620.9 ms]
  Range (min … max):   980.0 ms … 1007.0 ms    10 runs

After da34c4c

Benchmark 1:  ./datafusion-cli-no-buffers-opt  -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"
  Time (mean ± σ):     982.1 ms ±  10.0 ms    [User: 11962.7 ms, System: 607.3 ms]
  Range (min … max):   968.4 ms … 996.9 ms    10 runs

It seems reproducable:

andrewlamb@Andrews-MacBook-Pro-3:~/Downloads$ hyperfine --warmup 3 " ./datafusion-cli-starting  -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" "
Benchmark 1:  ./datafusion-cli-starting  -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"
  Time (mean ± σ):     985.9 ms ±  11.2 ms    [User: 11979.8 ms, System: 621.6 ms]
  Range (min … max):   970.4 ms … 1002.0 ms    10 runs

andrewlamb@Andrews-MacBook-Pro-3:~/Downloads$ hyperfine --warmup 3 " ./datafusion-cli-no-buffers-opt  -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" "
Benchmark 1:  ./datafusion-cli-no-buffers-opt  -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"
  Time (mean ± σ):     995.5 ms ±  17.8 ms    [User: 12140.3 ms, System: 627.2 ms]
  Range (min … max):   964.1 ms … 1023.3 ms    10 runs

alamb · 2025-12-19T16:52:07Z

I have a better version here:

Optimize muti-column grouping with StringView/ByteView (option 2) - 25% faster #19413

@camuel

…5% faster (#19413) ## Which issue does this PR close?  - Part of #18411 - Closes #19344 - Closes #19364 Note this is an alternate to #19364 ## Rationale for this change @camuel found a query where DuckDB's raw grouping is is faster. I looked into it and much of the difference can be explained by better vectorization in the comparisons and short string optimizations ## What changes are included in this PR? Optimize (will comment inline) ## Are these changes tested? By CI. See also benchmark results below. I tested manually as well Create Data: ```shell nice tpchgen-cli --tables=lineitem --format=parquet --scale-factor 100 ``` Run query: ```shell hyperfine --warmup 3 " datafusion-cli -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" " ``` Before (main): 1.368s ```shell Benchmark 1: datafusion-cli -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;" Time (mean ± σ): 1.393 s ± 0.020 s [User: 16.778 s, System: 0.688 s] Range (min … max): 1.368 s … 1.438 s 10 runs ``` After (this PR) 1.022s ```shell Benchmark 1: ./datafusion-cli-multi-gby-try2 -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;" Time (mean ± σ): 1.022 s ± 0.015 s [User: 11.685 s, System: 0.644 s] Range (min … max): 1.005 s … 1.052 s 10 runs ``` I have a PR that improves string view hashing performance too, see - #19374 ## Are there any user-facing changes? Faster performance

github-actions bot added the physical-plan Changes to the physical-plan crate label Dec 16, 2025

alamb commented Dec 16, 2025

View reviewed changes

alamb mentioned this pull request Dec 16, 2025

TESTING: Optimize byte view comparison in multi groupby #19344

Closed

alamb marked this pull request as ready for review December 17, 2025 13:14

alamb added 2 commits December 17, 2025 17:53

Optimize muti-column grouping with StringView/ByteView

5406190

Add special case for no-buffers

da34c4c

alamb force-pushed the alamb/multi_byte_view_comparison branch from 39690eb to da34c4c Compare December 17, 2025 22:58

alamb marked this pull request as draft December 17, 2025 22:59

alamb mentioned this pull request Dec 19, 2025

Optimize muti-column grouping with StringView/ByteView (option 2) - 25% faster #19413

Merged

alamb closed this Dec 19, 2025

Optimize muti-column grouping with StringView/ByteView #19364

Optimize muti-column grouping with StringView/ByteView #19364

Uh oh!

Conversation

alamb commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Dec 16, 2025

Uh oh!

alamb commented Dec 16, 2025

Uh oh!

Dandandan commented Dec 17, 2025

Uh oh!

alamb-ghbot commented Dec 17, 2025

Uh oh!

alamb commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb-ghbot commented Dec 17, 2025

Uh oh!

alamb-ghbot commented Dec 17, 2025

Uh oh!

alamb-ghbot commented Dec 17, 2025

Uh oh!

alamb-ghbot commented Dec 17, 2025

Uh oh!

alamb commented Dec 17, 2025

Uh oh!

alamb commented Dec 17, 2025

Uh oh!

alamb-ghbot commented Dec 17, 2025

Uh oh!

alamb commented Dec 17, 2025

Uh oh!

Dandandan commented Dec 18, 2025

Uh oh!

alamb-ghbot commented Dec 18, 2025

Uh oh!

alamb-ghbot commented Dec 18, 2025

Uh oh!

alamb commented Dec 18, 2025

Uh oh!

alamb commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alamb commented Dec 16, 2025 •

edited

Loading

alamb commented Dec 17, 2025 •

edited

Loading