-
Notifications
You must be signed in to change notification settings - Fork 140
Add AES support for RISC-V: RV64 scalar #493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
b743abb
to
029982c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite like the implementation in it's current form. I will need to read the code a bit more thoroughly to give concrete suggestions. You may have wrote it while keeping the RVV support in mind, but I think we should make a clear scalar implementation first. Potential generalization should be done in later PRs.
029982c
to
55facb2
Compare
Aside from vestiges in comments and conditionals referring to vector support which you pointed out (and I've since removed), there isn't anything intentional in the design of the scalar implementation which is influenced by the vector implementation. I'm not sure what you're referring to regarding generalizations. The scalar implementation here is "idiomatic", for lack of a better description, and appears less straightforward than what one might expect. That isn't because I wanted to write it this way. It is a reflection of the nature of RISC-V scalar crypto instructions: Due to those semantics, where the state of the cipher is staggered by half-rounds in this way, you end up with code that looks like this: #[inline(always)]
pub(super) fn enc1_two_more(&mut self, k0: RoundKey, k1: RoundKey) {
let mut n0;
let mut n1;
self.data[0] ^= k0[0];
self.data[1] ^= k0[1];
n0 = unsafe { aes64esm(self.data[0], self.data[1]) };
n1 = unsafe { aes64esm(self.data[1], self.data[0]) };
n0 ^= k1[0];
n1 ^= k1[1];
self.data[0] = unsafe { aes64esm(n0, n1) };
self.data[1] = unsafe { aes64esm(n1, n0) };
} I don't think that code is very easy to understand at a glance without already being familiar with the semantics of the instructions. That's why I tried to abstract it away into a clearer, although less direct interface: pub(super) struct CipherState1 {
data: [u64; 2],
}
impl CipherState1 {
#[inline(always)]
pub(super) fn load1(block: &Block) -> Self {
let ptr = block.as_ptr().cast::<u64>();
let s0 = unsafe { ptr.add(0).read_unaligned() };
let s1 = unsafe { ptr.add(1).read_unaligned() };
Self { data: [s0, s1] }
}
#[inline(always)]
pub(super) fn save1(self, block: &mut Block) {
let b0 = self.data[0].to_ne_bytes();
let b1 = self.data[1].to_ne_bytes();
block[00..08].copy_from_slice(&b0);
block[08..16].copy_from_slice(&b1);
}
#[inline(always)]
pub(super) fn xor1(&mut self, key: &RoundKey) {
self.data[0] ^= key[0];
self.data[1] ^= key[1];
}
#[inline(always)]
pub(super) fn enc1_two_more(&mut self, k0: RoundKey, k1: RoundKey) {
let mut n0;
let mut n1;
self.data[0] ^= k0[0];
self.data[1] ^= k0[1];
n0 = unsafe { aes64esm(self.data[0], self.data[1]) };
n1 = unsafe { aes64esm(self.data[1], self.data[0]) };
n0 ^= k1[0];
n1 ^= k1[1];
self.data[0] = unsafe { aes64esm(n0, n1) };
self.data[1] = unsafe { aes64esm(n1, n0) };
}
#[inline(always)]
pub(super) fn enc1_two_last(&mut self, k0: RoundKey, k1: RoundKey) {
let mut n0;
let mut n1;
self.data[0] ^= k0[0];
self.data[1] ^= k0[1];
n0 = unsafe { aes64esm(self.data[0], self.data[1]) };
n1 = unsafe { aes64esm(self.data[1], self.data[0]) };
n0 ^= k1[0];
n1 ^= k1[1];
self.data[0] = unsafe { aes64es(n0, n1) };
self.data[1] = unsafe { aes64es(n1, n0) };
}
#[inline(always)]
pub(super) fn dec1_two_more(&mut self, k0: RoundKey, k1: RoundKey) {
let mut n0;
let mut n1;
n0 = unsafe { aes64dsm(self.data[0], self.data[1]) };
n1 = unsafe { aes64dsm(self.data[1], self.data[0]) };
self.data[0] = n0 ^ k1[0];
self.data[1] = n1 ^ k1[1];
n0 = unsafe { aes64dsm(self.data[0], self.data[1]) };
n1 = unsafe { aes64dsm(self.data[1], self.data[0]) };
self.data[0] = n0 ^ k0[0];
self.data[1] = n1 ^ k0[1];
}
#[inline(always)]
pub(super) fn dec1_two_last(&mut self, k0: RoundKey, k1: RoundKey) {
let mut n0;
let mut n1;
n0 = unsafe { aes64dsm(self.data[0], self.data[1]) };
n1 = unsafe { aes64dsm(self.data[1], self.data[0]) };
self.data[0] = n0 ^ k1[0];
self.data[1] = n1 ^ k1[1];
n0 = unsafe { aes64ds(self.data[0], self.data[1]) };
n1 = unsafe { aes64ds(self.data[1], self.data[0]) };
self.data[0] = n0 ^ k0[0];
self.data[1] = n1 ^ k0[1];
}
} The situation is similar for the key-expansion instructions. If this is what you were referring to regarding generalization, we could remove the abstraction and make the code more direct, but I think that would make it harder to understand and harder to maintain. |
55facb2
to
d2afc69
Compare
8926add
to
6185849
Compare
8930dbf
to
e216555
Compare
Regarding optimal number of parallel blocks. I played a bit with this snippet which implements simple ECB encryption on aligned buffers. I measured a number of operations which perform stack-based store/load inside loop and got the following results:
Note that the results marked with asterisk required manual unrolling of the loop. It looks like 8 blocks is a fine number for AES-128, but for AES-192 and AES-256 it's likely worth to lower it to 6 or 7. I think the compiler could do a slightly better job with its register allocations, but alas. On RV32 number of parallel blocks should be probably 4 or smaller. In practice, on existing compilers the results will be much worse because of the horrible handling of misaligned loads/stores. I also wonder if existing (or planned for production in the near future) hardware would be able to benefit from such parallel block processing. If hardware is not sufficiently advanced, then by adding parallel block processing we only increase overhead. In other words, it may be worth to introduce a configuration flag which sets the parallel block number to 1 for this backend. |
e216555
to
bb383a4
Compare
That's useful to see, thanks.
That seems reasonable given the data. You are suggesting we use different block sizes for each then?
I don't know. It's still apparently hard to find information about existing or planned implementations. From this dump of RISC-V CPU features, the only EDIT: And apparently the only known ones that implement RVV crypto are SiFive p470, SiFive p670, and Tenstorrent Ascalon D8. But they don't seem to support scalar crypto if these feature sets are correct.
That would be fine I think. Some of this can be changed later too though, right? I mean, we don't necessarily have to figure out the optimal configuration right away especially if there's no practical way to actually measure real performance. |
I think as part of this PR it's worth to lower number of blocks for AES-192 and AES-256 and manually unroll the loop for parallel encrypt/decrypt functions. For the latter we can do something like this: pub(super) fn par_encrypt<const N: usize, M: ArraySize>(
keys: &RoundKeys<N>,
mut par_block: InOut<'_, '_, ParBlock<M>>,
) {
assert!(N == 11 || N == 13 || N == 15);
let mut block = AlignedParBlock::load(par_block.get_in());
let (rk_pairs, [last_rk]) = keys.as_chunks::<2>() else {
unreachable!("round keys failed pattern check");
};
block.encrypt(&rk_pairs[0]);
block.encrypt(&rk_pairs[1]);
block.encrypt(&rk_pairs[2]);
block.encrypt(&rk_pairs[3]);
if N == 13 || N == 15 {
block.encrypt(&rk_pairs[4]);
}
if N == 15 {
block.encrypt(&rk_pairs[5]);
}
block.encrypt_last(&rk_pairs[N / 2 - 1]);
block.xor(last_rk);
block.save(par_block.get_out());
} Addition of the non-parallel flag can be done in a separate PR. |
Also note that in the godbolt snippet above the compiler annoyingly and absolutely unnecessarily spills round keys to stack. It's a known problem, but there is zero work on it in the years since it was reported, so I wouldn't bet on it being resolved anytime soon... |
This PR implements AES support for RISC-V RV64 scalar crypto.
Related: