Skip to content

Conversation

silvanshade
Copy link
Contributor

This PR implements AES support for RISC-V RV64 scalar crypto.

Related:

Copy link
Member

@newpavlov newpavlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite like the implementation in it's current form. I will need to read the code a bit more thoroughly to give concrete suggestions. You may have wrote it while keeping the RVV support in mind, but I think we should make a clear scalar implementation first. Potential generalization should be done in later PRs.

@silvanshade
Copy link
Contributor Author

You may have wrote it while keeping the RVV support in mind, but I think we should make a clear scalar implementation first. Potential generalization should be done in later PRs.

Aside from vestiges in comments and conditionals referring to vector support which you pointed out (and I've since removed), there isn't anything intentional in the design of the scalar implementation which is influenced by the vector implementation.

I'm not sure what you're referring to regarding generalizations.

The scalar implementation here is "idiomatic", for lack of a better description, and appears less straightforward than what one might expect.

That isn't because I wanted to write it this way. It is a reflection of the nature of RISC-V scalar crypto instructions:

aes

Due to those semantics, where the state of the cipher is staggered by half-rounds in this way, you end up with code that looks like this:

#[inline(always)]
pub(super) fn enc1_two_more(&mut self, k0: RoundKey, k1: RoundKey) {
    let mut n0;
    let mut n1;
    self.data[0] ^= k0[0];
    self.data[1] ^= k0[1];
    n0 = unsafe { aes64esm(self.data[0], self.data[1]) };
    n1 = unsafe { aes64esm(self.data[1], self.data[0]) };
    n0 ^= k1[0];
    n1 ^= k1[1];
    self.data[0] = unsafe { aes64esm(n0, n1) };
    self.data[1] = unsafe { aes64esm(n1, n0) };
}

I don't think that code is very easy to understand at a glance without already being familiar with the semantics of the instructions.

That's why I tried to abstract it away into a clearer, although less direct interface:

pub(super) struct CipherState1 {
    data: [u64; 2],
}

impl CipherState1 {
    #[inline(always)]
    pub(super) fn load1(block: &Block) -> Self {
        let ptr = block.as_ptr().cast::<u64>();
        let s0 = unsafe { ptr.add(0).read_unaligned() };
        let s1 = unsafe { ptr.add(1).read_unaligned() };
        Self { data: [s0, s1] }
    }

    #[inline(always)]
    pub(super) fn save1(self, block: &mut Block) {
        let b0 = self.data[0].to_ne_bytes();
        let b1 = self.data[1].to_ne_bytes();
        block[00..08].copy_from_slice(&b0);
        block[08..16].copy_from_slice(&b1);
    }

    #[inline(always)]
    pub(super) fn xor1(&mut self, key: &RoundKey) {
        self.data[0] ^= key[0];
        self.data[1] ^= key[1];
    }

    #[inline(always)]
    pub(super) fn enc1_two_more(&mut self, k0: RoundKey, k1: RoundKey) {
        let mut n0;
        let mut n1;
        self.data[0] ^= k0[0];
        self.data[1] ^= k0[1];
        n0 = unsafe { aes64esm(self.data[0], self.data[1]) };
        n1 = unsafe { aes64esm(self.data[1], self.data[0]) };
        n0 ^= k1[0];
        n1 ^= k1[1];
        self.data[0] = unsafe { aes64esm(n0, n1) };
        self.data[1] = unsafe { aes64esm(n1, n0) };
    }

    #[inline(always)]
    pub(super) fn enc1_two_last(&mut self, k0: RoundKey, k1: RoundKey) {
        let mut n0;
        let mut n1;
        self.data[0] ^= k0[0];
        self.data[1] ^= k0[1];
        n0 = unsafe { aes64esm(self.data[0], self.data[1]) };
        n1 = unsafe { aes64esm(self.data[1], self.data[0]) };
        n0 ^= k1[0];
        n1 ^= k1[1];
        self.data[0] = unsafe { aes64es(n0, n1) };
        self.data[1] = unsafe { aes64es(n1, n0) };
    }

    #[inline(always)]
    pub(super) fn dec1_two_more(&mut self, k0: RoundKey, k1: RoundKey) {
        let mut n0;
        let mut n1;
        n0 = unsafe { aes64dsm(self.data[0], self.data[1]) };
        n1 = unsafe { aes64dsm(self.data[1], self.data[0]) };
        self.data[0] = n0 ^ k1[0];
        self.data[1] = n1 ^ k1[1];
        n0 = unsafe { aes64dsm(self.data[0], self.data[1]) };
        n1 = unsafe { aes64dsm(self.data[1], self.data[0]) };
        self.data[0] = n0 ^ k0[0];
        self.data[1] = n1 ^ k0[1];
    }

    #[inline(always)]
    pub(super) fn dec1_two_last(&mut self, k0: RoundKey, k1: RoundKey) {
        let mut n0;
        let mut n1;
        n0 = unsafe { aes64dsm(self.data[0], self.data[1]) };
        n1 = unsafe { aes64dsm(self.data[1], self.data[0]) };
        self.data[0] = n0 ^ k1[0];
        self.data[1] = n1 ^ k1[1];
        n0 = unsafe { aes64ds(self.data[0], self.data[1]) };
        n1 = unsafe { aes64ds(self.data[1], self.data[0]) };
        self.data[0] = n0 ^ k0[0];
        self.data[1] = n1 ^ k0[1];
    }
}

The situation is similar for the key-expansion instructions.

If this is what you were referring to regarding generalization, we could remove the abstraction and make the code more direct, but I think that would make it harder to understand and harder to maintain.

@silvanshade silvanshade force-pushed the riscv-scalar branch 5 times, most recently from 8926add to 6185849 Compare July 29, 2025 18:48
@silvanshade silvanshade requested a review from newpavlov July 29, 2025 18:53
@silvanshade silvanshade force-pushed the riscv-scalar branch 4 times, most recently from 8930dbf to e216555 Compare July 29, 2025 22:44
@newpavlov
Copy link
Member

newpavlov commented Jul 29, 2025

Regarding optimal number of parallel blocks. I played a bit with this snippet which implements simple ECB encryption on aligned buffers. I measured a number of operations which perform stack-based store/load inside loop and got the following results:

Blocks Aes-128 Aes-192 Aes-256
1 0 6 10
2 4 10 11
3 5 12 13
4 9 14 16
5 10 14 18
6 13 18 22
7 15 19 26*
8 20 47* 59*
9 62 70* 90*

Note that the results marked with asterisk required manual unrolling of the loop. It looks like 8 blocks is a fine number for AES-128, but for AES-192 and AES-256 it's likely worth to lower it to 6 or 7. I think the compiler could do a slightly better job with its register allocations, but alas.

On RV32 number of parallel blocks should be probably 4 or smaller.

In practice, on existing compilers the results will be much worse because of the horrible handling of misaligned loads/stores.

I also wonder if existing (or planned for production in the near future) hardware would be able to benefit from such parallel block processing. If hardware is not sufficiently advanced, then by adding parallel block processing we only increase overhead. In other words, it may be worth to introduce a configuration flag which sets the parallel block number to 1 for this backend.

@silvanshade
Copy link
Contributor Author

silvanshade commented Jul 30, 2025

Regarding optimal number of parallel blocks. I played a bit with this snippet which implements simple ECB encryption on aligned buffers. I measured a number of operations which perform stack-based store/load inside loop and got the following results:

That's useful to see, thanks.

Note that the results marked with asterisk required manual unrolling of the loop. It looks like 8 blocks is a fine number for AES-128, but for AES-192 and AES-256 it's likely worth to lower it to 6 or 7. I think the compiler could do a slightly better job with its register allocations, but alas.

That seems reasonable given the data. You are suggesting we use different block sizes for each then?

I also wonder if existing (or planned for production in the near future) hardware would be able to benefit from such parallel block processing.

I don't know. It's still apparently hard to find information about existing or planned implementations.

From this dump of RISC-V CPU features, the only {zknd, zkne}implementations LLVM is aware of are Syntacore SCR7 and XiangShan Nanhu. It doesn't seem like either of those are readily available either, although the Nanhu may be arriving as a Milk-V product eventually.

EDIT: And apparently the only known ones that implement RVV crypto are SiFive p470, SiFive p670, and Tenstorrent Ascalon D8. But they don't seem to support scalar crypto if these feature sets are correct.

In other words, it may be worth to introduce a configuration flag which sets the parallel block number to 1 for this backend.

That would be fine I think.

Some of this can be changed later too though, right? I mean, we don't necessarily have to figure out the optimal configuration right away especially if there's no practical way to actually measure real performance.

@newpavlov
Copy link
Member

newpavlov commented Jul 30, 2025

I think as part of this PR it's worth to lower number of blocks for AES-192 and AES-256 and manually unroll the loop for parallel encrypt/decrypt functions. For the latter we can do something like this:

pub(super) fn par_encrypt<const N: usize, M: ArraySize>(
    keys: &RoundKeys<N>,
    mut par_block: InOut<'_, '_, ParBlock<M>>,
) {
    assert!(N == 11 || N == 13 || N == 15);
    let mut block = AlignedParBlock::load(par_block.get_in());
    let (rk_pairs, [last_rk]) = keys.as_chunks::<2>() else {
        unreachable!("round keys failed pattern check");
    };
    
    block.encrypt(&rk_pairs[0]);
    block.encrypt(&rk_pairs[1]);
    block.encrypt(&rk_pairs[2]);
    block.encrypt(&rk_pairs[3]);
    if N == 13 || N == 15 {
        block.encrypt(&rk_pairs[4]);
    }
    if N == 15 {
        block.encrypt(&rk_pairs[5]);
    }
    block.encrypt_last(&rk_pairs[N / 2 - 1]);
    
    block.xor(last_rk);
    block.save(par_block.get_out());
}

Addition of the non-parallel flag can be done in a separate PR.

@newpavlov
Copy link
Member

Also note that in the godbolt snippet above the compiler annoyingly and absolutely unnecessarily spills round keys to stack. It's a known problem, but there is zero work on it in the years since it was reported, so I wouldn't bet on it being resolved anytime soon...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants