-
Notifications
You must be signed in to change notification settings - Fork 1k
WIP: special case bitwise ops when buffers are u64 aligned #8807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
🤖 |
| && left_suffix.is_empty() | ||
| && right_suffix.is_empty() | ||
| { | ||
| let result_u64s = left_u64s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty excited to see how much this actually helps with performance. This code should vectorize pretty spectacularly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TLDR: 30-50% faster 😎
|
🤖: Benchmark completed Details
|
|
🤖 |
30% - 50% faster 😎 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
It is strange that |
|
Since I think no one really calls the bitwise kernels directly, I would be inclined to consolidate the benchmarks as part of #8806 |
| #[allow(clippy::cast_ptr_alignment)] | ||
| let raw_data = self.buffer.as_ptr() as *const u64; | ||
|
|
||
| // bit-packed buffers are stored starting with the least-significant byte first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't your change of using align_to would not work on all bit alignments? (to_le)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As currently written I think this PR will work on both big and little endian -- as it only applies to buffers which are perfectly aligned and a multiple of a 64-bit boundary (aka there are no byte-wise operations occuring)
I think the endianess will come into play if we try to expand this technique to work data that is not an exact multiple of 64bits (aka the loop ends would likely be different)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but the user expect it to be at specific order when the callback is called and this violate that, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any problem. But I clearly don't really understand the concern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider the following (inefficient) code:
let mut index = 0;
bitwise_bin_op_helper(
... some args that are u64 aligned
|left, right| {
for each bit in left and right {
some_other_array[index] = left_bit & right_bit;
index += 1;
}
return left & right
}
)before and after your change will give different order of bits on certain endians
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see
yes I agree that some operation that moves the bits around in the u64 word could give different results on different endianesses
I think that is the case for the current bitwise operations too (they are supposed to be bitwise, not bit shuffling) 🤔
I will try and make a PR to update the docs to make this clearer
Which issue does this PR close?
apply_unary_opandapply_binary_opbitwise operations #8619Rationale for this change
While messing around with other bitwise operations, I am pretty sure we can optimize these operations more
Let's try using aligned u64 operations when possible
What changes are included in this PR?
Special case bitwise operations when the data is already aligned to u64 (a reasonably common special case)
Are these changes tested?
Yes by CI
Are there any user-facing changes?
No just faster performance