[WIP, Please benchmark] Use homogeneous coordinates in pippenger #1767
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This adds a new representation
gehof group elements, namely in homogeneous (also called projective) coordinates. This is supposed to be faster for unmixed (i.e., the second summand is notge) addition in terms of field operations, namely 12M+0S+25etc vs. 12M+4S+11etc for Jacobian coordinates.The addition and doubling formulas are due to Renes, Costello, and Batina 2016, Algorithms 7 and 9. The formulas are complete, i.e., they have no special cases. However, this implementation still keeps track of infinity in a dedicated boolean flag for performance reasons. Since the buckets in Pippenger's algorithm are initialized with infinity (=zero), we'll have many additions involving infinity, and going through the entire formula for each of those hurts performance (and the entire point of this PR is performance).
The formulas were implemented by giving GPT-5 mini screenshots of the algorithms in the paper and the
field.h. The result was not awesome but I could clean it up manually.The new representation is used in Pippenger's ecmult_multi for accumulating the buckets after every window iteration. Buckets are still constructed as
gej(because it has faster mixed addition) and only converted togehbefore accumulation. This is still supposed to be faster even if the conversion is accounted for. The conversion costs 2M+1S but we then do twogehadditions in a row, saving 8S. This PR has three different variants of howgehcould be used:geh.geh.gejfor rows of doublings.Unfortunately, none of these turns out to be really faster in
ecmult_bench pippenger_wnafon my x86_64 system with gcc 15.2.1 or clang 21.1.4. The best variant (2) beats master by just 0.21%; the other variants are slower than master. :/ If I compile in 32-bit mode, all three variants beat master consistently, but only by 1.2%. But this latter result gives at least some hope that this PR could pay off on some platform. I'm not even sure how much we care about 32-bit platforms. Maybe we care about hardware wallets in general, but probably not when it comes toecmult_multi. Plus this would need real benchmarks; I didn't even run this on a native 32-bit CPU).But we'd certainly care about ARM64 which I couldn't test on. Anyone with an ARM Mac willing to benchmark this?
The exact benchmark command was
SECP256K1_BENCH_ITERS=100000 bench_ecmult pippenger_wnaf(or20000iters for 32-bit). Don't forget thepippenger_wnafargument to make sure you don't benchmark Strauss' algorithm instead, at least below the threshold where we switch to Pippenger automatically. I did this on a 12th Gen Intel(R) Core(TM) i7-1260P, pinned to a P-core, and with TurboBoost disabled. See the attached spreadsheet: for details. benchmark-gcc.odsIf you want to benchmark this, I think it makes sense to get four runs per setup: one for the baseline (d0f3123, just disabling low point counts in
bench_ecmultfor quicker benchmarking) and the three "step" commits as mentioned above. You could just extend the spreadsheet with your results.Also, if you have any ideas on how to improve this further, I'd be happy to hear them. I tried various micro-optimizations, but none of them turned out to be significant on my machine. In fact, most of them made the code slower in practice. In theory, this PR should make it possible to increase the window size a bit, but playing around with the window size didn't make a difference either in practice.
edit: Don't care about CI. It fails on some platforms because I forgot to mark functions
static. This should compile locally without issues.