Skip to content

avx512f: implement slide_left and slide_right. #1148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 24, 2025

Conversation

degasus
Copy link
Contributor

@degasus degasus commented Jul 22, 2025

With a fast path for N = 4*i and a split version else (inspired from avx512bw).

As the vpermd actually has the lower latency on some intel cpus compared to vpermw / vpermb, also use it instead of the avx512bw and avx512vbmi implementations if trivially possible.

Also reverse the order of the slow path for lower latency. It used to be:

SLR -> PERM ->
               OR -> PERM
SLL         ->

Now the latency is reduced to:

SLR -> PERM ->
               OR
SLL -> PERM ->

And it should now generate even better code on avx512bw with N=63 with only one PERM (as already done for N=1).

For N=16,32,48, it prefers vshufi32x4 over vpermd for lower latency on Zen4 and decreased register usage.

@degasus
Copy link
Contributor Author

degasus commented Jul 22, 2025

FYI, the reason why I'm interested in those slide_left is because my code requires a cumsum:

batch cumsum(batch x) {
  for (size_t i = 0; i < std::bit_width(batch::size - 1); i++) {
    x += xsimd::slide_left<(sizeof(batch::value_type) << i)>(x); 
  }
  return x;
}

Is this a method you're interested in as part of XSIMD's public API?

With a fast path for N = 4*i and a split version else (inspired from avx512bw).

As the vpermd actually has the lower latency on some intel cpus compared to vpermw / vpermb, also use it instead of the avx512bw and avx512vbmi implementations if trivially possible.

Also reverse the order of the slow path for lower latency. It used to be:
```
SLR -> PERM ->
               OR -> PERM
SLL         ->
```

Now the latency is reduced to:
```
SLR -> PERM ->
               OR
SLL -> PERM ->
```

And it should now generate even better code on avx512bw with N=63 with only one PERM (as already done for N=1).

For N=16,32,48, it prefers vshufi32x4 over vpermd for lower latency on Zen4 and decreased register usage.
@serge-sans-paille
Copy link
Contributor

serge-sans-paille commented Jul 24, 2025

cumsum... isn't that https://xsimd.readthedocs.io/en/latest/api/reducer_index.html#_CPPv4I00E10reduce_add1TRK5batchI1T1AE ?
EDIT: it's not :-/ yeah of course add it as a generic operation!

@serge-sans-paille serge-sans-paille merged commit 4b8842c into xtensor-stack:master Jul 24, 2025
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants