-
Notifications
You must be signed in to change notification settings - Fork 142
Add AES support for RISC-V: RV64 vector #492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thank you! I would really like for this code to be tested somehow. Do you know what status of RVV support in QEMU? IIRC the scalar crypto extension should be properly supported by it. Also note that RISC-V has an annoyingly horrible (IMO) handling of misaligned loads, in
I think you should fall back to the soft key expansion here, we certainly should not remove
Do you know how much blocks is processed in parallel by existing hardware with support of the vector crypto extension? Have you inspected the resulting assembly for the scalar crypto code? It would be nice to have godbolt links for these. I would expect for this number to vary between AES variants and RV32/64. |
It also would be better to create a separate PR for the scalar crypto support. We probably will be able to merge it faster. |
Yes. I have tested it in QEMU (both scalar and vector) and it does pass all the tests. The main issue is I don't have a point of reference for how much faster it is than software since I don't have any RISC-V hardware with crypto extensions. You should be able to run it with the following # rv64 only
[target.riscv64gc-unknown-linux-gnu]
runner = "qemu-riscv64-static -cpu rv64,zkne=true,zknd=true"
linker = "riscv64-linux-gnu-gcc"
rustflags = [
"-Ctarget-feature=+zkne,+zknd"
] # rv64 + rvv
[target.riscv64gc-unknown-linux-gnu]
runner = "qemu-riscv64-static -cpu rv64,v=true,vext_spec=v1.0,zkne=true,zknd=true,zvkned=true"
linker = "riscv64-linux-gnu-gcc"
rustflags = [
"-Ctarget-feature=+v,+zkne,+zknd,+zvkned"
] Then just You may also need to set
I know that misaligned loads can be an issue on RISC-V but haven't actually had to deal with that directly much in practice. I haven't done anything to specifically handle that in this implementation but it is something to consider. I did see your post on reddit and the I would expect the situation is probably indeed similar for RVV. It may be that we would need to implement similar workarounds here.
No. I need to look into this again. I think the last time I checked there wasn't much information available because hardware implementations were rare.
I haven't looked at the generated code in detail yet. Anything specific you are looking for in that, like performance around misaligned loads or something else? For godbolt, I think I'd have to factor out some part of the algorithms and make it self-contained to be able to compile on godbolt. Is that what you are suggesting?
Okay. I can do that. |
Are you suggesting to use the fixslice software implementation for key expansion? The reason I didn't use that already is because I wasn't sure how to convert from the fixslice key-expansion representation back to the standard representation. If that's what you're suggesting, can you give me an example of how to do that? If you're instead suggesting a new software implementation, I didn't do that because I assumed we would want a side-channel resistant implementation (like the fixslice) and wasn't sure what would be a good approach for that here. |
I meant that we should test the new backends as part of our CI, you can see the
Yeah, just factor out encrypt/decrypt functions like I did in this PR. It would allow us to see any potential problems (such as handling of misaligned pointers) and to see how number of blocks changes the generated code.
As a starting point we could use the bitsliced implementation of |
I was thinking about alignment more, trying to recall my stance on the issue when I first implemented these algorithms. I think in my earliest implementation, I did actually perform some sort of alignment correction like what you refer to. But ultimately I decided not to include that, for a few reasons:
Regarding (3), it's unfortunate the user needs to care about this as much as they probably do, but I sort of feel that this is a platform hazard they ought to be aware of and that we might do more harm than good in trying to work around. Maybe the best option here is to provide a stricter subset of the API that requires correct alignment and throws an error if the user passes in misaligned data. Then we just do our best to ensure we preserve correct alignment in our own code, which might require being more specific about it in our data structures and functions. (I don't know how much of an issue this would be for us to address). Having said that, if you feel strongly enough that we should still try to work around alignment issues, I won't argue against that. This assumes by the way that we aren't penalized by using calls for unaligned loads when the data is actually correctly aligned already. I haven't read through those posts of yours in detail yet. If that is in fact the case, that performance suffers just by trying to be defensive about it, then that changes things and I guess we are stuck trying to correct for it. EDIT: Maybe we should also look to see whether OpenSSL does anything about this. |
Unfortunately, it's necessary on Linux as well despite the fact that it guarantees that misaligned operations always work in user space, unless you manually pass
It was suggested, but it's unclear how to design such API without making it extremely inconvenient. Varying the required alignment depending on target/backend would be also quite bad.
Without the hack we are always penalized since the compiler generates byte loads which then get stitched together. With the hack the penalty on loads is only in adding one additional branch, which can be moved outside of block processing loop and may be even eliminated completely by the compiler if it sees that input buffer is sufficiently aligned. With stores it's a bit more complicated since the branch may prevent the compiler from applying some optimizations when combined with other crates (like keeping block data in registers). |
I opened LLVM issue with suggestion to implement the branching hack as an LLVM optimization. I am really close to just saying that it's a compiler/ISA problem and do not deal with it in our code. |
30f17c6
to
627354a
Compare
I added the CI tests and also split the scalar support out into a separate PR: |
9eef2a9
to
4d1d42d
Compare
This PR adds AES support for RV64 scalar and RVV.
Some caveats:
The implementation chooses par block of 8 for scalar and 64 for RVV. I don't have any capable hardware to test this with and am not sure what the ideal configuration should be. The RVV code uses VLA style so the par block size can be easily adapted to anything. In the long run it would probably be better to provide an API that can take advantage of this directly.
If RVV is selected for the backend and scalar crypto is not also supported (no idea how common this is in practice), then AES-192 support is disabled. This is because RVV crypto does not provide key expansion instructions for AES-192, only for AES-128 and AES-256. If scalar crypto is available, we can fall back to the scalar just for key expansion for AES-192.
It would be possible to implement something special for that case instead of completely disabling it if someone wants to take that on. Maybe the fixslice implementation can be adapted somehow.
Some remaining work: