avx512f: implement slide_left and slide_right. #1148

degasus · 2025-07-22T16:26:03Z

With a fast path for N = 4*i and a split version else (inspired from avx512bw).

As the vpermd actually has the lower latency on some intel cpus compared to vpermw / vpermb, also use it instead of the avx512bw and avx512vbmi implementations if trivially possible.

Also reverse the order of the slow path for lower latency. It used to be:

SLR -> PERM ->
               OR -> PERM
SLL         ->

Now the latency is reduced to:

SLR -> PERM ->
               OR
SLL -> PERM ->

And it should now generate even better code on avx512bw with N=63 with only one PERM (as already done for N=1).

For N=16,32,48, it prefers vshufi32x4 over vpermd for lower latency on Zen4 and decreased register usage.

degasus · 2025-07-22T16:38:41Z

FYI, the reason why I'm interested in those slide_left is because my code requires a cumsum:

batch cumsum(batch x) {
  for (size_t i = 0; i < std::bit_width(batch::size - 1); i++) {
    x += xsimd::slide_left<(sizeof(batch::value_type) << i)>(x); 
  }
  return x;
}

Is this a method you're interested in as part of XSIMD's public API?

With a fast path for N = 4*i and a split version else (inspired from avx512bw). As the vpermd actually has the lower latency on some intel cpus compared to vpermw / vpermb, also use it instead of the avx512bw and avx512vbmi implementations if trivially possible. Also reverse the order of the slow path for lower latency. It used to be: ``` SLR -> PERM -> OR -> PERM SLL -> ``` Now the latency is reduced to: ``` SLR -> PERM -> OR SLL -> PERM -> ``` And it should now generate even better code on avx512bw with N=63 with only one PERM (as already done for N=1). For N=16,32,48, it prefers vshufi32x4 over vpermd for lower latency on Zen4 and decreased register usage.

serge-sans-paille · 2025-07-24T20:11:37Z

cumsum... isn't that https://xsimd.readthedocs.io/en/latest/api/reducer_index.html#_CPPv4I00E10reduce_add1TRK5batchI1T1AE ?
EDIT: it's not :-/ yeah of course add it as a generic operation!

degasus force-pushed the opt_shift branch from 38e3950 to be94cd6 Compare July 22, 2025 16:30

degasus force-pushed the opt_shift branch from be94cd6 to f4e3909 Compare July 22, 2025 16:53

degasus force-pushed the opt_shift branch from f4e3909 to 945c694 Compare July 23, 2025 19:05

serge-sans-paille merged commit 4b8842c into xtensor-stack:master Jul 24, 2025
63 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

avx512f: implement slide_left and slide_right. #1148

avx512f: implement slide_left and slide_right. #1148

Uh oh!

degasus commented Jul 22, 2025

Uh oh!

degasus commented Jul 22, 2025 •

edited

Loading

Uh oh!

serge-sans-paille commented Jul 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

avx512f: implement slide_left and slide_right. #1148

avx512f: implement slide_left and slide_right. #1148

Uh oh!

Conversation

degasus commented Jul 22, 2025

Uh oh!

degasus commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serge-sans-paille commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

degasus commented Jul 22, 2025 •

edited

Loading

serge-sans-paille commented Jul 24, 2025 •

edited

Loading