Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Nov 7, 2025

Which issue does this PR close?

Rationale for this change

While messing around with other bitwise operations, I am pretty sure we can optimize these operations more

Let's try using aligned u64 operations when possible

What changes are included in this PR?

Special case bitwise operations when the data is already aligned to u64 (a reasonably common special case)

Are these changes tested?

Yes by CI

Are there any user-facing changes?

No just faster performance

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 7, 2025
@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/test_offset_zero (aff95f2) to d379b98 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_test_offset_zero
Results will be posted here when complete

&& left_suffix.is_empty()
&& right_suffix.is_empty()
{
let result_u64s = left_u64s
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty excited to see how much this actually helps with performance. This code should vectorize pretty spectacularly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLDR: 30-50% faster 😎

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2025

🤖: Benchmark completed

Details

group         alamb_test_offset_zero                 main
-----         ----------------------                 ----
and           1.00    208.2±0.55ns        ? ?/sec    1.34    279.5±1.21ns        ? ?/sec
and_sliced    1.00   1227.4±1.83ns        ? ?/sec    1.00   1227.3±4.50ns        ? ?/sec
not           1.00    143.0±0.21ns        ? ?/sec    1.50    215.0±0.41ns        ? ?/sec
not_sliced    1.05    732.9±0.85ns        ? ?/sec    1.00    698.3±1.12ns        ? ?/sec
or            1.00    199.0±0.43ns        ? ?/sec    1.26    250.6±0.54ns        ? ?/sec
or_sliced     1.00   1095.3±1.49ns        ? ?/sec    1.00  1099.6±15.01ns        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/test_offset_zero (aff95f2) to d379b98 diff
BENCH_NAME=buffer_bit_ops
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench buffer_bit_ops
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_test_offset_zero
Results will be posted here when complete

@alamb alamb mentioned this pull request Nov 7, 2025
@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2025

🤖: Benchmark completed

30% - 50% faster 😎

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2025

🤖: Benchmark completed

Details

group                                alamb_test_offset_zero                 main
-----                                ----------------------                 ----
buffer_binary_ops/and                1.00    211.4±0.31ns    67.7 GB/sec    1.22    257.4±0.32ns    55.6 GB/sec
buffer_binary_ops/and_with_offset    1.13   1488.6±3.34ns     9.6 GB/sec    1.00   1319.1±1.74ns    10.8 GB/sec
buffer_binary_ops/or                 1.00    208.0±0.71ns    68.8 GB/sec    1.23    255.8±0.38ns    55.9 GB/sec
buffer_binary_ops/or_with_offset     1.00   1351.8±2.57ns    10.6 GB/sec    1.10   1482.3±3.39ns     9.7 GB/sec
buffer_unary_ops/not                 1.00    204.4±1.50ns    46.7 GB/sec    1.08    221.6±0.52ns    43.0 GB/sec
buffer_unary_ops/not_with_offset     1.00    908.4±1.14ns    10.5 GB/sec    1.27  1157.3±11.06ns     8.2 GB/sec

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/test_offset_zero (aff95f2) to d379b98 diff
BENCH_NAME=boolean_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench boolean_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_test_offset_zero
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2025

🤖: Benchmark completed

Details

group         alamb_test_offset_zero                 main
-----         ----------------------                 ----
and           1.00    208.0±0.56ns        ? ?/sec    1.33    275.8±0.79ns        ? ?/sec
and_sliced    1.00   1228.8±3.74ns        ? ?/sec    1.00   1226.5±1.16ns        ? ?/sec
not           1.00    143.0±0.36ns        ? ?/sec    1.52    216.9±1.43ns        ? ?/sec
not_sliced    1.05    736.1±1.29ns        ? ?/sec    1.00    698.2±1.57ns        ? ?/sec
or            1.00    198.6±0.31ns        ? ?/sec    1.26    251.0±0.41ns        ? ?/sec
or_sliced     1.00   1095.7±2.09ns        ? ?/sec    1.00   1100.0±1.93ns        ? ?/sec

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/test_offset_zero (aff95f2) to d379b98 diff
BENCH_NAME=buffer_bit_ops
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench buffer_bit_ops
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_test_offset_zero
Results will be posted here when complete

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2025

🤖: Benchmark completed

Details

group                                alamb_test_offset_zero                 main
-----                                ----------------------                 ----
buffer_binary_ops/and                1.00    211.4±0.35ns    67.7 GB/sec    1.22    257.6±0.87ns    55.5 GB/sec
buffer_binary_ops/and_with_offset    1.13   1488.8±3.99ns     9.6 GB/sec    1.00   1320.2±3.26ns    10.8 GB/sec
buffer_binary_ops/or                 1.00    208.1±0.45ns    68.7 GB/sec    1.23    255.7±0.53ns    55.9 GB/sec
buffer_binary_ops/or_with_offset     1.00   1351.5±3.51ns    10.6 GB/sec    1.10   1482.3±2.84ns     9.7 GB/sec
buffer_unary_ops/not                 1.00    204.2±1.34ns    46.7 GB/sec    1.09    222.1±0.56ns    42.9 GB/sec
buffer_unary_ops/not_with_offset     1.00    909.1±1.19ns    10.5 GB/sec    1.27   1155.6±1.45ns     8.3 GB/sec

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2025

It is strange that buffer_binary_ops/and_with_offset is reported to be slower but and_sliced is not (and it calls the same implementatation). Perhaps the overhead of checking alignment dominates the actual call 🤔

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2025

Since I think no one really calls the bitwise kernels directly, I would be inclined to consolidate the benchmarks as part of #8806

#[allow(clippy::cast_ptr_alignment)]
let raw_data = self.buffer.as_ptr() as *const u64;

// bit-packed buffers are stored starting with the least-significant byte first
Copy link
Member

@rluvaton rluvaton Nov 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't your change of using align_to would not work on all bit alignments? (to_le)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As currently written I think this PR will work on both big and little endian -- as it only applies to buffers which are perfectly aligned and a multiple of a 64-bit boundary (aka there are no byte-wise operations occuring)

I think the endianess will come into play if we try to expand this technique to work data that is not an exact multiple of 64bits (aka the loop ends would likely be different)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the user expect it to be at specific order when the callback is called and this violate that, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is any problem. But I clearly don't really understand the concern

Copy link
Member

@rluvaton rluvaton Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider the following (inefficient) code:

let mut index = 0;
bitwise_bin_op_helper(
	... some args that are u64 aligned
	|left, right| {
		for each bit in left and right {
			some_other_array[index] = left_bit & right_bit;
            index += 1;
		}
        return left & right
	}
)

before and after your change will give different order of bits on certain endians

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see

yes I agree that some operation that moves the bits around in the u64 word could give different results on different endianesses

I think that is the case for the current bitwise operations too (they are supposed to be bitwise, not bit shuffling) 🤔

I will try and make a PR to update the docs to make this clearer

@alamb alamb added enhancement Any new improvement worthy of a entry in the changelog performance labels Nov 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants