-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Align vectors on disk for optimal performance on ARM CPUs #15341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
I wrote a small JMH benchmark to "pad" float vectors on disk with some
My machine uses the 256-bit variant of Panama to score vectors, so I saw optimal performance when floats are aligned to 32 bytes -- but keeping it 64 here as the max case.. |
cc @mikemccand who found this^ byte-misalignment possibility offline! |
Also noting that for byte vectors, I saw no impact of padding:
..so I'm not changing its alignment in this PR |
Wow, alignment still matters, and it matters a lot (24 -> 33 ops/us)! Thank you @kaivalnp for testing. Was this an It's frustrating how the CPU just silently runs slower ... but what else could it do. I wonder whether modern x86-64 (Intel, AMD) CPUs also show this effect. I'll test this PR on nightly Lucene benchy box ( |
Yes, it was a Graviton3 (
|
I tested on
I applied this PR, built (
Net/net it seems like alignment of the mapped in-ram (virtual address space) doesn't matter? I also tested newer CPU (Raptor Lake) -- I'll post that shortly. |
Raptor Lake box is i9-13900K:
Results:
There might be small some mis-alignment penalty for float SIMD? |
from the cpu's optimization guide: |
Thanks @mikemccand, there doesn't seem to be any performance penalty on "beast3 (nightly benchmarking box) -- a Ryzen Threadripper 3990X". There's definitely some impact of alignment on "Raptor Lake box is i9-13900K", but this is lower than my machine (<10%) -- so this alignment issue is mostly on Graviton, or ARM CPUs in general, as @rmuir shared? I tried running
This PR (64-byte-alignment)
Indexing was sped up by ~7.6%, while Search was sped up by ~3.8% I see another action item from this benchmark: I wasn't aligning the output inside this merge function, which is used by HNSW-based vector formats for merging (see that |
Thanks @rmuir. How does the Panama Vector API handle alignment? Does it have methods to allocate aligned on-heap or off-heap vectors? Hmm it looks like |
Oh good catch! I wonder what other places might write the flat vectors? Is the alignment also (or maybe less) important for the quantized cases? (Your results above are for Maybe at least |
This reverts commit 5764ac8.
Hmm this did not help for some reason (merge time increased)..
This PR (64-byte-alignment)
I'll still add a commit + revert, so people can see what I tried, and comment if I'm missing something!
I think alignment is less important for quantized vectors (which are stored as byte vectors on disk) -- because none of the JMH benchmarks show non-trivial variation with padding? (see
Yeah, those benchmarks^ are for float vectors
I added some print statements to complain if non-64-byte-aligned addresses were used ( Not committing because it may not be needed after this PR? |
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
Description
Today, float vectors are aligned to 4 bytes in a Lucene index, but with Panama -- we can work with (upto) 512 bits (== 64 bytes, or 16 floats) at the same time.
I wonder if we should change this alignment to 64 bytes, in order to get optimal vector search performance with Panama?