Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Dec 16, 2025

Which issue does this PR close?

Rationale for this change

@camuel found a query where DuckDB's raw grouping is is faster.

I looked into it and much of the difference can be explained by better vectorization in the comparisons and short string optimizations

What changes are included in this PR?

Optimize (will comment inline)

Are these changes tested?

By CI. See also benchmark results below. I tested manually as well

Create Data:

nice tpchgen-cli --tables=lineitem --format=parquet --scale-factor 100

Run query:

time datafusion-cli  -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"

Before (main)

real	0m1.320s
user	0m16.887s
sys	0m0.669s

After (this PR)

real	0m1.037s
user	0m12.365s
sys	0m0.682s

I have some thoughts to improve string view hashing performance too -- will make as a separate PR

Are there any user-facing changes?

Faster performance

@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Dec 16, 2025

fn vectorized_equal_to_inner(
/// Comparison when there are no nulls in array
fn vectorized_equal_to_no_nulls(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change 1 is to create a second copy of the loop when there are no nulls (to avoid the null check and give LLVM a better chance to optimize)

array: &GenericByteViewArray<B>,
rhs_row: usize,
) -> bool {
// SAFETY: the row indexes passed to vectorized_equal are in bounds
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization 2: skip bounds check on data access

)
};
exist_inline == input_inline
// the views are inlined and the lengths are equal, so just
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization 3: just compare the view directly rather than breaking it into parts first

@alamb
Copy link
Contributor Author

alamb commented Dec 16, 2025

run benchmarks

@alamb
Copy link
Contributor Author

alamb commented Dec 16, 2025

run benchmark tpch

@Dandandan
Copy link
Contributor

run benchmarks

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/multi_byte_view_comparison (39690eb) to 50d20dd diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Dec 17, 2025

My runner script keeps dying when there is a problem with the scripts. I am working on a way to keep it going (make it more resilent to failures)

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and alamb_multi_byte_view_comparison
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_multi_byte_view_comparison ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2660.36 ms │                       2601.22 ms │ no change │
│ QQuery 1     │  1286.95 ms │                       1310.89 ms │ no change │
│ QQuery 2     │  2550.99 ms │                       2486.51 ms │ no change │
│ QQuery 3     │  1154.27 ms │                       1105.38 ms │ no change │
│ QQuery 4     │  2233.04 ms │                       2244.30 ms │ no change │
│ QQuery 5     │ 28548.05 ms │                      28375.50 ms │ no change │
│ QQuery 6     │  3979.59 ms │                       3939.18 ms │ no change │
│ QQuery 7     │  3379.67 ms │                       3441.10 ms │ no change │
└──────────────┴─────────────┴──────────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                               ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                               │ 45792.92ms │
│ Total Time (alamb_multi_byte_view_comparison)   │ 45504.08ms │
│ Average Time (HEAD)                             │  5724.11ms │
│ Average Time (alamb_multi_byte_view_comparison) │  5688.01ms │
│ Queries Faster                                  │          0 │
│ Queries Slower                                  │          0 │
│ Queries with No Change                          │          8 │
│ Queries with Failure                            │          0 │
└─────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_multi_byte_view_comparison ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.46 ms │                          2.25 ms │ +1.10x faster │
│ QQuery 1     │    49.63 ms │                         50.76 ms │     no change │
│ QQuery 2     │   130.55 ms │                        134.38 ms │     no change │
│ QQuery 3     │   155.21 ms │                        154.41 ms │     no change │
│ QQuery 4     │  1058.23 ms │                       1044.99 ms │     no change │
│ QQuery 5     │  1427.87 ms │                       1458.75 ms │     no change │
│ QQuery 6     │     2.08 ms │                          2.13 ms │     no change │
│ QQuery 7     │    57.50 ms │                         57.19 ms │     no change │
│ QQuery 8     │  1364.42 ms │                       1357.62 ms │     no change │
│ QQuery 9     │  1763.00 ms │                       1810.69 ms │     no change │
│ QQuery 10    │   386.07 ms │                        379.89 ms │     no change │
│ QQuery 11    │   433.70 ms │                        419.83 ms │     no change │
│ QQuery 12    │  1325.84 ms │                       1333.81 ms │     no change │
│ QQuery 13    │  2025.58 ms │                       2010.85 ms │     no change │
│ QQuery 14    │  1265.55 ms │                       1216.11 ms │     no change │
│ QQuery 15    │  1187.30 ms │                       1182.08 ms │     no change │
│ QQuery 16    │  2590.85 ms │                       2521.91 ms │     no change │
│ QQuery 17    │  2566.20 ms │                       2490.35 ms │     no change │
│ QQuery 18    │  4892.63 ms │                       4738.91 ms │     no change │
│ QQuery 19    │   119.61 ms │                        121.86 ms │     no change │
│ QQuery 20    │  1875.55 ms │                       1837.59 ms │     no change │
│ QQuery 21    │  2175.43 ms │                       2173.27 ms │     no change │
│ QQuery 22    │  3739.74 ms │                       3703.37 ms │     no change │
│ QQuery 23    │ 12492.36 ms │                      12431.98 ms │     no change │
│ QQuery 24    │   216.07 ms │                        209.73 ms │     no change │
│ QQuery 25    │   474.88 ms │                        481.87 ms │     no change │
│ QQuery 26    │   216.52 ms │                        221.48 ms │     no change │
│ QQuery 27    │  2699.74 ms │                       2686.46 ms │     no change │
│ QQuery 28    │ 24092.13 ms │                      23987.16 ms │     no change │
│ QQuery 29    │   978.97 ms │                        953.06 ms │     no change │
│ QQuery 30    │  1296.55 ms │                       1318.28 ms │     no change │
│ QQuery 31    │  1297.85 ms │                       1338.78 ms │     no change │
│ QQuery 32    │  4919.38 ms │                       4499.39 ms │ +1.09x faster │
│ QQuery 33    │  5814.56 ms │                       5724.61 ms │     no change │
│ QQuery 34    │  5769.92 ms │                       5726.25 ms │     no change │
│ QQuery 35    │  1884.52 ms │                       1873.58 ms │     no change │
│ QQuery 36    │    66.11 ms │                         68.66 ms │     no change │
│ QQuery 37    │    45.93 ms │                         44.88 ms │     no change │
│ QQuery 38    │    63.87 ms │                         70.53 ms │  1.10x slower │
│ QQuery 39    │    98.66 ms │                        103.01 ms │     no change │
│ QQuery 40    │    27.37 ms │                         26.79 ms │     no change │
│ QQuery 41    │    23.14 ms │                         23.49 ms │     no change │
│ QQuery 42    │    19.29 ms │                         19.89 ms │     no change │
└──────────────┴─────────────┴──────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                               ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                               │ 93092.83ms │
│ Total Time (alamb_multi_byte_view_comparison)   │ 92012.89ms │
│ Average Time (HEAD)                             │  2164.95ms │
│ Average Time (alamb_multi_byte_view_comparison) │  2139.83ms │
│ Queries Faster                                  │          2 │
│ Queries Slower                                  │          1 │
│ Queries with No Change                          │         40 │
│ Queries with Failure                            │          0 │
└─────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_multi_byte_view_comparison ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 143.22 ms │                        118.24 ms │ +1.21x faster │
│ QQuery 2     │  29.31 ms │                         29.34 ms │     no change │
│ QQuery 3     │  36.72 ms │                         38.94 ms │  1.06x slower │
│ QQuery 4     │  29.68 ms │                         28.68 ms │     no change │
│ QQuery 5     │  90.00 ms │                         87.45 ms │     no change │
│ QQuery 6     │  24.32 ms │                         19.99 ms │ +1.22x faster │
│ QQuery 7     │ 273.16 ms │                        226.90 ms │ +1.20x faster │
│ QQuery 8     │  43.96 ms │                         34.12 ms │ +1.29x faster │
│ QQuery 9     │ 116.15 ms │                        106.64 ms │ +1.09x faster │
│ QQuery 10    │  75.61 ms │                         62.61 ms │ +1.21x faster │
│ QQuery 11    │  19.65 ms │                         19.71 ms │     no change │
│ QQuery 12    │  63.90 ms │                         51.38 ms │ +1.24x faster │
│ QQuery 13    │  56.73 ms │                         48.39 ms │ +1.17x faster │
│ QQuery 14    │  16.31 ms │                         14.01 ms │ +1.16x faster │
│ QQuery 15    │  27.79 ms │                         24.65 ms │ +1.13x faster │
│ QQuery 16    │  26.23 ms │                         24.79 ms │ +1.06x faster │
│ QQuery 17    │ 179.65 ms │                        149.04 ms │ +1.21x faster │
│ QQuery 18    │ 294.61 ms │                        276.83 ms │ +1.06x faster │
│ QQuery 19    │  45.16 ms │                         37.14 ms │ +1.22x faster │
│ QQuery 20    │  49.01 ms │                         50.20 ms │     no change │
│ QQuery 21    │ 327.96 ms │                        301.84 ms │ +1.09x faster │
│ QQuery 22    │  17.25 ms │                         17.67 ms │     no change │
└──────────────┴───────────┴──────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                               ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                               │ 1986.39ms │
│ Total Time (alamb_multi_byte_view_comparison)   │ 1768.54ms │
│ Average Time (HEAD)                             │   90.29ms │
│ Average Time (alamb_multi_byte_view_comparison) │   80.39ms │
│ Queries Faster                                  │        15 │
│ Queries Slower                                  │         1 │
│ Queries with No Change                          │         6 │
│ Queries with Failure                            │         0 │
└─────────────────────────────────────────────────┴───────────┘

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/multi_byte_view_comparison (39690eb) to 50d20dd diff using: tpch
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and alamb_multi_byte_view_comparison
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_multi_byte_view_comparison ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 212.24 ms │                        197.29 ms │ +1.08x faster │
│ QQuery 2     │  97.69 ms │                         94.62 ms │     no change │
│ QQuery 3     │ 120.75 ms │                        123.57 ms │     no change │
│ QQuery 4     │  75.88 ms │                         76.94 ms │     no change │
│ QQuery 5     │ 175.33 ms │                        175.18 ms │     no change │
│ QQuery 6     │  67.79 ms │                         67.88 ms │     no change │
│ QQuery 7     │ 207.70 ms │                        209.71 ms │     no change │
│ QQuery 8     │ 165.65 ms │                        165.16 ms │     no change │
│ QQuery 9     │ 223.06 ms │                        224.60 ms │     no change │
│ QQuery 10    │ 183.75 ms │                        183.52 ms │     no change │
│ QQuery 11    │  74.98 ms │                         73.93 ms │     no change │
│ QQuery 12    │ 119.37 ms │                        118.45 ms │     no change │
│ QQuery 13    │ 215.24 ms │                        213.45 ms │     no change │
│ QQuery 14    │  93.30 ms │                         88.17 ms │ +1.06x faster │
│ QQuery 15    │ 122.27 ms │                        123.95 ms │     no change │
│ QQuery 16    │  58.69 ms │                         58.23 ms │     no change │
│ QQuery 17    │ 269.21 ms │                        273.34 ms │     no change │
│ QQuery 18    │ 307.57 ms │                        310.35 ms │     no change │
│ QQuery 19    │ 136.22 ms │                        134.94 ms │     no change │
│ QQuery 20    │ 124.35 ms │                        123.55 ms │     no change │
│ QQuery 21    │ 266.98 ms │                        265.49 ms │     no change │
│ QQuery 22    │  43.72 ms │                         42.24 ms │     no change │
└──────────────┴───────────┴──────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                               ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                               │ 3361.73ms │
│ Total Time (alamb_multi_byte_view_comparison)   │ 3344.56ms │
│ Average Time (HEAD)                             │  152.81ms │
│ Average Time (alamb_multi_byte_view_comparison) │  152.03ms │
│ Queries Faster                                  │         2 │
│ Queries Slower                                  │         0 │
│ Queries with No Change                          │        20 │
│ Queries with Failure                            │         0 │
└─────────────────────────────────────────────────┴───────────┘

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/multi_byte_view_comparison (39690eb) to 50d20dd diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Dec 17, 2025

TPCH Q1 does have the pattern that is optimized in this PR (multiple group by columns) so it is plausible that the benefits are measured there

select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,
sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
lineitem
where
l_shipdate <= date '1998-09-02'
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus;

@alamb
Copy link
Contributor Author

alamb commented Dec 17, 2025

My runner script keeps dying when there is a problem with the scripts. I am working on a way to keep it going (make it more resilent to failures)

I added some error checking here: alamb/datafusion-benchmarking@64ebd3a

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and alamb_multi_byte_view_comparison
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_multi_byte_view_comparison ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2654.93 ms │                       2698.13 ms │ no change │
│ QQuery 1     │  1318.69 ms │                       1258.88 ms │ no change │
│ QQuery 2     │  2438.94 ms │                       2526.07 ms │ no change │
│ QQuery 3     │  1132.09 ms │                       1130.33 ms │ no change │
│ QQuery 4     │  2250.77 ms │                       2241.25 ms │ no change │
│ QQuery 5     │ 28459.49 ms │                      28764.05 ms │ no change │
│ QQuery 6     │  3966.94 ms │                       3978.11 ms │ no change │
│ QQuery 7     │  3455.76 ms │                       3546.16 ms │ no change │
└──────────────┴─────────────┴──────────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                               ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                               │ 45677.61ms │
│ Total Time (alamb_multi_byte_view_comparison)   │ 46142.98ms │
│ Average Time (HEAD)                             │  5709.70ms │
│ Average Time (alamb_multi_byte_view_comparison) │  5767.87ms │
│ Queries Faster                                  │          0 │
│ Queries Slower                                  │          0 │
│ Queries with No Change                          │          8 │
│ Queries with Failure                            │          0 │
└─────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_multi_byte_view_comparison ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.45 ms │                          2.24 ms │ +1.09x faster │
│ QQuery 1     │    52.67 ms │                         51.50 ms │     no change │
│ QQuery 2     │   134.23 ms │                        134.41 ms │     no change │
│ QQuery 3     │   151.39 ms │                        154.17 ms │     no change │
│ QQuery 4     │  1069.20 ms │                       1251.10 ms │  1.17x slower │
│ QQuery 5     │  1457.17 ms │                       1737.72 ms │  1.19x slower │
│ QQuery 6     │     2.20 ms │                          2.12 ms │     no change │
│ QQuery 7     │    57.10 ms │                         55.54 ms │     no change │
│ QQuery 8     │  1418.06 ms │                       1568.71 ms │  1.11x slower │
│ QQuery 9     │  1820.46 ms │                       1982.91 ms │  1.09x slower │
│ QQuery 10    │   383.95 ms │                        381.85 ms │     no change │
│ QQuery 11    │   436.87 ms │                        429.02 ms │     no change │
│ QQuery 12    │  1364.82 ms │                       1644.13 ms │  1.20x slower │
│ QQuery 13    │  2030.01 ms │                       2213.73 ms │  1.09x slower │
│ QQuery 14    │  1245.92 ms │                       1399.10 ms │  1.12x slower │
│ QQuery 15    │  1220.64 ms │                       1433.16 ms │  1.17x slower │
│ QQuery 16    │  2610.27 ms │                       2731.04 ms │     no change │
│ QQuery 17    │  2593.14 ms │                       2730.75 ms │  1.05x slower │
│ QQuery 18    │  5201.98 ms │                       5053.13 ms │     no change │
│ QQuery 19    │   123.45 ms │                        124.49 ms │     no change │
│ QQuery 20    │  1942.33 ms │                       1904.46 ms │     no change │
│ QQuery 21    │  2212.15 ms │                       2192.22 ms │     no change │
│ QQuery 22    │  3829.95 ms │                       3806.20 ms │     no change │
│ QQuery 23    │ 16774.03 ms │                      12620.76 ms │ +1.33x faster │
│ QQuery 24    │   214.67 ms │                        222.76 ms │     no change │
│ QQuery 25    │   469.39 ms │                        488.55 ms │     no change │
│ QQuery 26    │   238.30 ms │                        210.19 ms │ +1.13x faster │
│ QQuery 27    │  2733.25 ms │                       2671.62 ms │     no change │
│ QQuery 28    │ 24468.41 ms │                      24083.89 ms │     no change │
│ QQuery 29    │   953.81 ms │                        970.17 ms │     no change │
│ QQuery 30    │  1358.21 ms │                       1374.33 ms │     no change │
│ QQuery 31    │  1386.23 ms │                       1382.56 ms │     no change │
│ QQuery 32    │  5287.29 ms │                       4855.42 ms │ +1.09x faster │
│ QQuery 33    │  5831.31 ms │                       5784.91 ms │     no change │
│ QQuery 34    │  6047.35 ms │                       5824.61 ms │     no change │
│ QQuery 35    │  1929.25 ms │                       2081.71 ms │  1.08x slower │
│ QQuery 36    │    70.23 ms │                         68.42 ms │     no change │
│ QQuery 37    │    46.23 ms │                         47.32 ms │     no change │
│ QQuery 38    │    67.06 ms │                         68.95 ms │     no change │
│ QQuery 39    │   104.26 ms │                        107.45 ms │     no change │
│ QQuery 40    │    27.77 ms │                         26.43 ms │     no change │
│ QQuery 41    │    23.46 ms │                         24.41 ms │     no change │
│ QQuery 42    │    20.52 ms │                         20.13 ms │     no change │
└──────────────┴─────────────┴──────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                               ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                               │ 99411.41ms │
│ Total Time (alamb_multi_byte_view_comparison)   │ 95918.30ms │
│ Average Time (HEAD)                             │  2311.89ms │
│ Average Time (alamb_multi_byte_view_comparison) │  2230.66ms │
│ Queries Faster                                  │          4 │
│ Queries Slower                                  │         10 │
│ Queries with No Change                          │         29 │
│ Queries with Failure                            │          0 │
└─────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_multi_byte_view_comparison ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 131.95 ms │                        118.14 ms │ +1.12x faster │
│ QQuery 2     │  28.62 ms │                         28.00 ms │     no change │
│ QQuery 3     │  38.36 ms │                         37.47 ms │     no change │
│ QQuery 4     │  29.07 ms │                         28.98 ms │     no change │
│ QQuery 5     │  87.08 ms │                         88.72 ms │     no change │
│ QQuery 6     │  19.72 ms │                         19.91 ms │     no change │
│ QQuery 7     │ 230.33 ms │                        228.48 ms │     no change │
│ QQuery 8     │  35.99 ms │                         38.56 ms │  1.07x slower │
│ QQuery 9     │ 109.87 ms │                        113.87 ms │     no change │
│ QQuery 10    │  64.34 ms │                         64.48 ms │     no change │
│ QQuery 11    │  17.76 ms │                         19.00 ms │  1.07x slower │
│ QQuery 12    │  51.67 ms │                         51.38 ms │     no change │
│ QQuery 13    │  48.76 ms │                         52.19 ms │  1.07x slower │
│ QQuery 14    │  14.23 ms │                         14.00 ms │     no change │
│ QQuery 15    │  24.49 ms │                         26.71 ms │  1.09x slower │
│ QQuery 16    │  26.54 ms │                         24.76 ms │ +1.07x faster │
│ QQuery 17    │ 151.67 ms │                        154.51 ms │     no change │
│ QQuery 18    │ 280.64 ms │                        283.08 ms │     no change │
│ QQuery 19    │  37.06 ms │                         37.22 ms │     no change │
│ QQuery 20    │  50.88 ms │                         48.96 ms │     no change │
│ QQuery 21    │ 318.31 ms │                        316.90 ms │     no change │
│ QQuery 22    │  17.58 ms │                         17.90 ms │     no change │
└──────────────┴───────────┴──────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                               ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                               │ 1814.93ms │
│ Total Time (alamb_multi_byte_view_comparison)   │ 1813.21ms │
│ Average Time (HEAD)                             │   82.50ms │
│ Average Time (alamb_multi_byte_view_comparison) │   82.42ms │
│ Queries Faster                                  │         2 │
│ Queries Slower                                  │         4 │
│ Queries with No Change                          │        16 │
│ Queries with Failure                            │         0 │
└─────────────────────────────────────────────────┴───────────┘

@alamb alamb marked this pull request as ready for review December 17, 2025 13:14
@alamb alamb force-pushed the alamb/multi_byte_view_comparison branch from 39690eb to da34c4c Compare December 17, 2025 22:58
@alamb alamb marked this pull request as draft December 17, 2025 22:59
@alamb
Copy link
Contributor Author

alamb commented Dec 17, 2025

Marking back to draft as I try out some other ideas from @Dandandan

@Dandandan
Copy link
Contributor

run benchmark tpcds

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/multi_byte_view_comparison (da34c4c) to 14cd71e diff using: tpcds
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and alamb_multi_byte_view_comparison
--------------------
Benchmark tpcds_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_multi_byte_view_comparison ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │    61.79 ms │                         61.64 ms │     no change │
│ QQuery 2     │   203.56 ms │                        200.17 ms │     no change │
│ QQuery 3     │   153.87 ms │                        159.01 ms │     no change │
│ QQuery 4     │  2033.58 ms │                       1931.14 ms │ +1.05x faster │
│ QQuery 5     │   267.11 ms │                        262.78 ms │     no change │
│ QQuery 6     │  1550.81 ms │                       1531.81 ms │     no change │
│ QQuery 7     │   502.46 ms │                        495.18 ms │     no change │
│ QQuery 8     │   168.22 ms │                        169.04 ms │     no change │
│ QQuery 9     │   267.41 ms │                        268.63 ms │     no change │
│ QQuery 10    │   165.27 ms │                        165.74 ms │     no change │
│ QQuery 11    │  1380.34 ms │                       1378.82 ms │     no change │
│ QQuery 12    │    70.46 ms │                         73.32 ms │     no change │
│ QQuery 13    │   555.03 ms │                        555.82 ms │     no change │
│ QQuery 14    │  1991.64 ms │                       2017.76 ms │     no change │
│ QQuery 15    │    27.90 ms │                         28.44 ms │     no change │
│ QQuery 16    │    59.48 ms │                         59.13 ms │     no change │
│ QQuery 17    │   351.16 ms │                        346.88 ms │     no change │
│ QQuery 18    │   189.18 ms │                        193.45 ms │     no change │
│ QQuery 19    │   227.42 ms │                        232.14 ms │     no change │
│ QQuery 20    │    22.62 ms │                         23.24 ms │     no change │
│ QQuery 21    │    34.71 ms │                         34.96 ms │     no change │
│ QQuery 22    │   971.80 ms │                        902.51 ms │ +1.08x faster │
│ QQuery 23    │  1812.07 ms │                       1801.98 ms │     no change │
│ QQuery 24    │   633.93 ms │                        632.98 ms │     no change │
│ QQuery 25    │   506.75 ms │                        512.24 ms │     no change │
│ QQuery 26    │   122.62 ms │                        128.65 ms │     no change │
│ QQuery 27    │   488.57 ms │                        494.95 ms │     no change │
│ QQuery 28    │   302.27 ms │                        302.85 ms │     no change │
│ QQuery 29    │   440.73 ms │                        436.42 ms │     no change │
│ QQuery 30    │    63.60 ms │                         65.63 ms │     no change │
│ QQuery 31    │   297.01 ms │                        302.12 ms │     no change │
│ QQuery 32    │    76.73 ms │                         80.40 ms │     no change │
│ QQuery 33    │   192.92 ms │                        189.27 ms │     no change │
│ QQuery 34    │   158.22 ms │                        162.63 ms │     no change │
│ QQuery 35    │   169.30 ms │                        165.74 ms │     no change │
│ QQuery 36    │   289.25 ms │                        292.32 ms │     no change │
│ QQuery 37    │   257.86 ms │                        260.83 ms │     no change │
│ QQuery 38    │   146.98 ms │                        145.08 ms │     no change │
│ QQuery 39    │   213.20 ms │                        216.10 ms │     no change │
│ QQuery 40    │   188.79 ms │                        188.73 ms │     no change │
│ QQuery 41    │    16.29 ms │                         16.25 ms │     no change │
│ QQuery 42    │   140.42 ms │                        141.79 ms │     no change │
│ QQuery 43    │   124.25 ms │                        124.67 ms │     no change │
│ QQuery 44    │    15.84 ms │                         15.85 ms │     no change │
│ QQuery 45    │    82.22 ms │                         84.98 ms │     no change │
│ QQuery 46    │   325.38 ms │                        328.71 ms │     no change │
│ QQuery 47    │  1314.45 ms │                       1272.21 ms │     no change │
│ QQuery 48    │   416.60 ms │                        417.76 ms │     no change │
│ QQuery 49    │   352.27 ms │                        356.68 ms │     no change │
│ QQuery 50    │   335.37 ms │                        338.14 ms │     no change │
│ QQuery 51    │   295.39 ms │                        299.23 ms │     no change │
│ QQuery 52    │   142.53 ms │                        142.80 ms │     no change │
│ QQuery 53    │   149.14 ms │                        149.82 ms │     no change │
│ QQuery 54    │   207.33 ms │                        213.63 ms │     no change │
│ QQuery 55    │   143.22 ms │                        143.18 ms │     no change │
│ QQuery 56    │   192.50 ms │                        195.91 ms │     no change │
│ QQuery 57    │   315.98 ms │                        316.97 ms │     no change │
│ QQuery 58    │   513.48 ms │                        520.51 ms │     no change │
│ QQuery 59    │   282.42 ms │                        281.60 ms │     no change │
│ QQuery 60    │   198.26 ms │                        196.24 ms │     no change │
│ QQuery 61    │   233.36 ms │                        237.64 ms │     no change │
│ QQuery 62    │  1350.10 ms │                       1342.62 ms │     no change │
│ QQuery 63    │   147.61 ms │                        149.77 ms │     no change │
│ QQuery 64    │  1155.50 ms │                       1162.17 ms │     no change │
│ QQuery 65    │   352.19 ms │                        362.20 ms │     no change │
│ QQuery 66    │   380.25 ms │                        370.56 ms │     no change │
│ QQuery 67    │   630.71 ms │                        610.62 ms │     no change │
│ QQuery 68    │   388.26 ms │                        383.19 ms │     no change │
│ QQuery 69    │   167.36 ms │                        163.01 ms │     no change │
│ QQuery 70    │   511.00 ms │                        496.85 ms │     no change │
│ QQuery 71    │   183.38 ms │                        181.99 ms │     no change │
│ QQuery 72    │  2533.36 ms │                       2500.78 ms │     no change │
│ QQuery 73    │   154.52 ms │                        153.24 ms │     no change │
│ QQuery 74    │   874.07 ms │                        852.25 ms │     no change │
│ QQuery 75    │   394.61 ms │                        400.61 ms │     no change │
│ QQuery 76    │   186.05 ms │                        186.17 ms │     no change │
│ QQuery 77    │   261.26 ms │                        263.80 ms │     no change │
│ QQuery 78    │   937.04 ms │                        930.09 ms │     no change │
│ QQuery 79    │   338.60 ms │                        337.28 ms │     no change │
│ QQuery 80    │   500.89 ms │                        494.00 ms │     no change │
│ QQuery 81    │    42.96 ms │                         41.36 ms │     no change │
│ QQuery 82    │   294.18 ms │                        296.19 ms │     no change │
│ QQuery 83    │    68.78 ms │                         65.59 ms │     no change │
│ QQuery 84    │    64.69 ms │                         63.12 ms │     no change │
│ QQuery 85    │   218.62 ms │                        222.75 ms │     no change │
│ QQuery 86    │    58.20 ms │                         57.96 ms │     no change │
│ QQuery 87    │   151.80 ms │                        144.76 ms │     no change │
│ QQuery 88    │   240.58 ms │                        243.38 ms │     no change │
│ QQuery 89    │   167.52 ms │                        167.34 ms │     no change │
│ QQuery 90    │    36.75 ms │                         34.93 ms │     no change │
│ QQuery 91    │    99.56 ms │                         94.41 ms │ +1.05x faster │
│ QQuery 92    │    77.48 ms │                         78.27 ms │     no change │
│ QQuery 93    │   267.09 ms │                        264.30 ms │     no change │
│ QQuery 94    │    86.23 ms │                         83.12 ms │     no change │
│ QQuery 95    │   261.23 ms │                        261.66 ms │     no change │
│ QQuery 96    │   109.22 ms │                        113.19 ms │     no change │
│ QQuery 97    │   190.34 ms │                        182.64 ms │     no change │
│ QQuery 98    │   239.16 ms │                        233.03 ms │     no change │
│ QQuery 99    │ 14834.27 ms │                      14821.76 ms │     no change │
└──────────────┴─────────────┴──────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                               ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                               │ 52368.73ms │
│ Total Time (alamb_multi_byte_view_comparison)   │ 52076.04ms │
│ Average Time (HEAD)                             │   528.98ms │
│ Average Time (alamb_multi_byte_view_comparison) │   526.02ms │
│ Queries Faster                                  │          3 │
│ Queries Slower                                  │          0 │
│ Queries with No Change                          │         96 │
│ Queries with Failure                            │          0 │
└─────────────────────────────────────────────────┴────────────┘

@alamb
Copy link
Contributor Author

alamb commented Dec 18, 2025

I benchmarked with/without the no buffers optimization and it seems to be worth around 1% (10ms out of 990ms)

hyperfine --warmup 3 " ./datafusion-cli-starting  -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" "

Starting (before da34c4c)

andrewlamb@Andrews-MacBook-Pro-3:~/Downloads$ hyperfine --warmup 3 " ./datafusion-cli-starting  -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" "
Benchmark 1:  ./datafusion-cli-starting  -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"
  Time (mean ± σ):     989.4 ms ±   9.2 ms    [User: 11962.4 ms, System: 620.9 ms]
  Range (min … max):   980.0 ms … 1007.0 ms    10 runs

After da34c4c

Benchmark 1:  ./datafusion-cli-no-buffers-opt  -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"
  Time (mean ± σ):     982.1 ms ±  10.0 ms    [User: 11962.7 ms, System: 607.3 ms]
  Range (min … max):   968.4 ms … 996.9 ms    10 runs

It seems reproducable:

andrewlamb@Andrews-MacBook-Pro-3:~/Downloads$ hyperfine --warmup 3 " ./datafusion-cli-starting  -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" "
Benchmark 1:  ./datafusion-cli-starting  -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"
  Time (mean ± σ):     985.9 ms ±  11.2 ms    [User: 11979.8 ms, System: 621.6 ms]
  Range (min … max):   970.4 ms … 1002.0 ms    10 runs

andrewlamb@Andrews-MacBook-Pro-3:~/Downloads$ hyperfine --warmup 3 " ./datafusion-cli-no-buffers-opt  -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" "
Benchmark 1:  ./datafusion-cli-no-buffers-opt  -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"
  Time (mean ± σ):     995.5 ms ±  17.8 ms    [User: 12140.3 ms, System: 627.2 ms]
  Range (min … max):   964.1 ms … 1023.3 ms    10 runs

@alamb
Copy link
Contributor Author

alamb commented Dec 19, 2025

@alamb alamb closed this Dec 19, 2025
github-merge-queue bot pushed a commit that referenced this pull request Dec 20, 2025
…5% faster (#19413)

## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Part of #18411
- Closes #19344
- Closes #19364

Note this is an alternate to
#19364

## Rationale for this change

@camuel found a query where DuckDB's raw grouping is is faster. 

I looked into it and much of the difference can be explained by better
vectorization in the comparisons and short string optimizations

## What changes are included in this PR?

Optimize (will comment inline)

## Are these changes tested?

By CI. See also benchmark results below. I tested manually as well

Create Data:
```shell
nice tpchgen-cli --tables=lineitem --format=parquet --scale-factor 100
```

Run query:
```shell
hyperfine --warmup 3 " datafusion-cli   -c \"select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;\" "
```

Before (main): 1.368s
```shell
Benchmark 1:  datafusion-cli   -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"
  Time (mean ± σ):      1.393 s ±  0.020 s    [User: 16.778 s, System: 0.688 s]
  Range (min … max):    1.368 s …  1.438 s    10 runs
```

After (this PR) 1.022s
```shell
Benchmark 1:  ./datafusion-cli-multi-gby-try2   -c "select l_returnflag,l_linestatus, count(*) as count_order from 'lineitem.parquet' group by l_returnflag, l_linestatus;"
  Time (mean ± σ):      1.022 s ±  0.015 s    [User: 11.685 s, System: 0.644 s]
  Range (min … max):    1.005 s …  1.052 s    10 runs
```

I have a PR that improves string view hashing performance too, see
- #19374

## Are there any user-facing changes?
Faster performance
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants