Skip to content

Conversation

silvanshade
Copy link
Contributor

@silvanshade silvanshade commented Jul 23, 2025

This PR adds AES support for RV64 scalar and RVV.

Some caveats:

The implementation chooses par block of 8 for scalar and 64 for RVV. I don't have any capable hardware to test this with and am not sure what the ideal configuration should be. The RVV code uses VLA style so the par block size can be easily adapted to anything. In the long run it would probably be better to provide an API that can take advantage of this directly.

If RVV is selected for the backend and scalar crypto is not also supported (no idea how common this is in practice), then AES-192 support is disabled. This is because RVV crypto does not provide key expansion instructions for AES-192, only for AES-128 and AES-256. If scalar crypto is available, we can fall back to the scalar just for key expansion for AES-192.

It would be possible to implement something special for that case instead of completely disabling it if someone wants to take that on. Maybe the fixslice implementation can be adapted somehow.

Some remaining work:

  • encode tail blocks with VLA
  • re-implement CI tests

@newpavlov
Copy link
Member

newpavlov commented Jul 23, 2025

Thank you! I would really like for this code to be tested somehow. Do you know what status of RVV support in QEMU? IIRC the scalar crypto extension should be properly supported by it.

Also note that RISC-V has an annoyingly horrible (IMO) handling of misaligned loads, in sha2 I had to work around this using this hack. I am not familiar with RVV enough, but I heard it also has some issues with processing misaligned data.

This is because RVV crypto does not provide key expansion instructions for AES-192, only for AES-128 and AES-256.

I think you should fall back to the soft key expansion here, we certainly should not remove Aes192 from public exports.

The implementation chooses par block of 8 for scalar and 64 for RVV

Do you know how much blocks is processed in parallel by existing hardware with support of the vector crypto extension? Have you inspected the resulting assembly for the scalar crypto code? It would be nice to have godbolt links for these. I would expect for this number to vary between AES variants and RV32/64.

@newpavlov
Copy link
Member

newpavlov commented Jul 23, 2025

It also would be better to create a separate PR for the scalar crypto support. We probably will be able to merge it faster.

@silvanshade
Copy link
Contributor Author

silvanshade commented Jul 23, 2025

Thank you! I would really like for this code to be tested somehow. Do you know what status of RVV support in QEMU? IIRC the scalar crypto extension should be properly supported by it.

Yes. I have tested it in QEMU (both scalar and vector) and it does pass all the tests. The main issue is I don't have a point of reference for how much faster it is than software since I don't have any RISC-V hardware with crypto extensions.

You should be able to run it with the following .cargo/config.toml:

# rv64 only
[target.riscv64gc-unknown-linux-gnu]
runner = "qemu-riscv64-static -cpu rv64,zkne=true,zknd=true"
linker = "riscv64-linux-gnu-gcc"
rustflags = [
    "-Ctarget-feature=+zkne,+zknd"
]
# rv64 + rvv
[target.riscv64gc-unknown-linux-gnu]
runner = "qemu-riscv64-static -cpu rv64,v=true,vext_spec=v1.0,zkne=true,zknd=true,zvkned=true"
linker = "riscv64-linux-gnu-gcc"
rustflags = [
    "-Ctarget-feature=+v,+zkne,+zknd,+zvkned"
]

Then just cargo test --target riscv64gc-unknown-linux-gnu.

You may also need to set QEMU_LD_PREFIX or append, e.g., -L /usr/riscv64-linux-gnu, to the QEMU args depending on the system. Arch Linux QEMU needs this but IIRC Debian/Ubuntu does not. And you may also need to have the cross-toolchain installed for the linker and sysroot to actually be available, e.g., riscv64-linux-gnu-gcc on Arch.

Also note that RISC-V has an annoyingly horrible (IMO) handling of misaligned loads, in sha2 I had to work around this using this hack. I am not familiar with RVV enough, but I heard it also has some issues with processing misaligned data.

I know that misaligned loads can be an issue on RISC-V but haven't actually had to deal with that directly much in practice.

I haven't done anything to specifically handle that in this implementation but it is something to consider. I did see your post on reddit and the riscv-isa-manual thread about poor performance of misaligned loads.

I would expect the situation is probably indeed similar for RVV.

It may be that we would need to implement similar workarounds here.

Do you know how much blocks is processed in parallel by existing hardware with support of the vector crypto extension?

No. I need to look into this again. I think the last time I checked there wasn't much information available because hardware implementations were rare.

Have you inspected the resulting assembly for the scalar crypto code? It would be nice to have godbolt links for these.

I haven't looked at the generated code in detail yet. Anything specific you are looking for in that, like performance around misaligned loads or something else?

For godbolt, I think I'd have to factor out some part of the algorithms and make it self-contained to be able to compile on godbolt. Is that what you are suggesting?

It also would be better to create a separate PR for the scalar crypto support. We probably will be able to merge it faster.

Okay. I can do that.

@silvanshade
Copy link
Contributor Author

I think you should fall back to the soft key expansion here, we certainly should not remove Aes192 from public exports.

Are you suggesting to use the fixslice software implementation for key expansion? The reason I didn't use that already is because I wasn't sure how to convert from the fixslice key-expansion representation back to the standard representation.

If that's what you're suggesting, can you give me an example of how to do that?

If you're instead suggesting a new software implementation, I didn't do that because I assumed we would want a side-channel resistant implementation (like the fixslice) and wasn't sure what would be a good approach for that here.

@newpavlov
Copy link
Member

I meant that we should test the new backends as part of our CI, you can see the sha2 workflow for reference.

For godbolt, I think I'd have to factor out some part of the algorithms and make it self-contained to be able to compile on godbolt. Is that what you are suggesting?

Yeah, just factor out encrypt/decrypt functions like I did in this PR. It would allow us to see any potential problems (such as handling of misaligned pointers) and to see how number of blocks changes the generated code.

Are you suggesting to use the fixslice software implementation for key expansion?

As a starting point we could use the bitsliced implementation of Aes192 as-is without modifying it.

@silvanshade
Copy link
Contributor Author

silvanshade commented Jul 23, 2025

Also note that RISC-V has an annoyingly horrible (IMO) handling of misaligned loads, in sha2 I had to work around this using this hack. I am not familiar with RVV enough, but I heard it also has some issues with processing misaligned data.

I was thinking about alignment more, trying to recall my stance on the issue when I first implemented these algorithms.

I think in my earliest implementation, I did actually perform some sort of alignment correction like what you refer to. But ultimately I decided not to include that, for a few reasons:

  1. It wasn't necessary (on Linux at least)
  2. It added complexity (more surface area for bugs)
  3. It potentially hid issues from user with their own code

Regarding (3), it's unfortunate the user needs to care about this as much as they probably do, but I sort of feel that this is a platform hazard they ought to be aware of and that we might do more harm than good in trying to work around.

Maybe the best option here is to provide a stricter subset of the API that requires correct alignment and throws an error if the user passes in misaligned data. Then we just do our best to ensure we preserve correct alignment in our own code, which might require being more specific about it in our data structures and functions. (I don't know how much of an issue this would be for us to address).

Having said that, if you feel strongly enough that we should still try to work around alignment issues, I won't argue against that.

This assumes by the way that we aren't penalized by using calls for unaligned loads when the data is actually correctly aligned already. I haven't read through those posts of yours in detail yet. If that is in fact the case, that performance suffers just by trying to be defensive about it, then that changes things and I guess we are stuck trying to correct for it.

EDIT: Maybe we should also look to see whether OpenSSL does anything about this.

@newpavlov
Copy link
Member

newpavlov commented Jul 23, 2025

It wasn't necessary (on Linux at least)

Unfortunately, it's necessary on Linux as well despite the fact that it guarantees that misaligned operations always work in user space, unless you manually pass -mno-strict-align. The problem is that this may be achieved by trapping, which can be extremely slow. So compilers have to be conservative and generate dumb code to emulate misaligned word loads with aligned byte loads. The same applies to the Zicclsm extension (again, really stupid IMO). You can see it in this link (note that the snippet is compiled for riscv64gc-unknown-linux-gnu).

Maybe the best option here is to provide a stricter subset of the API that requires correct alignment and throws an error if the user passes in misaligned data

It was suggested, but it's unclear how to design such API without making it extremely inconvenient. Varying the required alignment depending on target/backend would be also quite bad.

This assumes by the way that we aren't penalized by using calls for unaligned loads when the data is actually correctly aligned already.

Without the hack we are always penalized since the compiler generates byte loads which then get stitched together. With the hack the penalty on loads is only in adding one additional branch, which can be moved outside of block processing loop and may be even eliminated completely by the compiler if it sees that input buffer is sufficiently aligned. With stores it's a bit more complicated since the branch may prevent the compiler from applying some optimizations when combined with other crates (like keeping block data in registers).

@newpavlov
Copy link
Member

newpavlov commented Jul 23, 2025

I opened LLVM issue with suggestion to implement the branching hack as an LLVM optimization. I am really close to just saying that it's a compiler/ISA problem and do not deal with it in our code.

@silvanshade silvanshade force-pushed the riscv branch 5 times, most recently from 30f17c6 to 627354a Compare July 26, 2025 21:21
@silvanshade silvanshade changed the title Add AES support for RISC-V: RV64 Scalar and RVV Add AES support for RISC-V: RV64 vector Jul 26, 2025
@silvanshade
Copy link
Contributor Author

I added the CI tests and also split the scalar support out into a separate PR:

@silvanshade silvanshade force-pushed the riscv branch 2 times, most recently from 9eef2a9 to 4d1d42d Compare July 27, 2025 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants