I'm super confused. Let's back up a step. Most accelerators come in one of a few...

anon4242 · on Nov 24, 2019

Yes, my main experience is with 2) and these are pretty "modern" (as in recently released MCUs) that support AES-ECB (and maybe a few more in HW). These are not ARMv8 but Cortex-M level MCUs.

The problem and the point I'm trying to make is that a few platforms implement their ECB support in such a way to make it almost useless as a building block. They do not do it as processor instructions the way it's done in x86 (the right way IMHO) but instead it's implemented in a separate co-processor that you program in a similar manner as you setup a typical DMA-transfer. If you aim to encrypt 1KB or more the setup cost for this is negligible and you can get a comparatively good speed. However as we both agree there are very few cases (if any) where you _actually_ want to run ECB over 1KB blocks at a time. When you want to build something like CTR (or CBC), what you need is a fast way to ECB _a single_ AES block (i.e. 16 bytes). With this kind of solution the setting up of the co-processor eats up almost any gains won by doing ECB in HW compared to doing ECB in SW because the cost of the setup (it's I/O after all) comes close to the cost of a SW only ECB of 16 bytes.

debatem1 · on Nov 25, 2019

Hmm? With CTR you usually just want to fill a long buffer with the appropriate counters and then shove it all through the accelerator. The resulting stream can then be used until exhausted by whatever higher level primitive you're working with. Obviously there's a trade-off in sizing the buffer correctly, but dozens of blocks would be more typical than one.

anon4242 · on Nov 26, 2019

Yes, and that is what I said in my very first post in this thread. Still you need to handle the counters in SW, do the XORs in SW (unless you have some HW that does that for you as well) and then if you want CCM you need to solve CBCMAC (maybe you have CBC in HW but then there's the memory trade-off again). If you want GCM you need to do BigInt muls (Cortex-M MCUs do not support 128 bit muls). So either way you end up doing pretty substantial parts of it in SW which limits the usability.