Hacker News new | past | comments | ask | show | jobs | submit login

I'm super confused. Let's back up a step.

Most accelerators come in one of a few flavors:

1/ They implement the expensive parts of a primitive for you and let you chain them together. This is how AES-NI and the ARMv8 crypto extensions work. Performance for these is generally measured in terms of cycle latency, or with a reference piece of software in cycles per byte. Common values for cycles per byte are anywhere from about 0.2 to 30. Much higher than that and people will start to go look at software as an option. You tend to see these on beefy systems with out-of-order cores.

2/ They implement a primitive for you, eg AES-ECB or SHA256, or more rarely AES-GCM and similar. These can then be chained together as with the above to build even higher level primitives like AES-CTR or AES-CCM, or they can be used as-is. These are usually found on micros as additional selling points, and therefore show up just above the bottom of most manufacturers' product lines as an upsell. These are typically measured in something like MB/s throughput, and I assume they're what you're focused on.

3/ They implement a full protocol, like TLS, CCMP, or secure boot. These show up on things that might more properly deserve the term SoC rather than microcontroller, largely because they tend to be attached to high-speed I/O. They generally aren't measured for cryptographic performance but rather for the performance of the implemented protocol.

In my mind, all three of these are using crypto accelerators. Taken together it is extremely common that a part will have one or more of these, and I'm not sure if we're still disagreeing on one or both of those points.

Regarding ECB, I don't know what you mean. Almost nobody uses ECB alone (thank goodness). Even if they have an accelerator for it, it's usually used to implement something like CTR with some software to glue it together (maybe with then yet more glue to do GCM). In that way, those accelerators act like a just-barely-higher-level version of the first type-- and if what you have is the first type of course you'll do that no matter what. This is still an accelerated implementation, it's just not 100% done in the accelerator. Of course, if you're doing that you're very often encrypting more than a block at a time. And because it's quite rare that you will be performance bottlenecked on a small infrequent operation in any context, you generally only do the work to turn on the accelerators when you care about that.

Regarding working on MCUs, I agree there's a minimum size past which you don't get crypto primitives, but overall don't think characterizing those parts as modern SoCs is terribly accurate (which was my claim).

Regarding needing better building blocks than ECB for a protocol... well, no, not necessarily. AES-NI doesn't even give you a full AES primitive, and yet it's extremely widely used.




Yes, my main experience is with 2) and these are pretty "modern" (as in recently released MCUs) that support AES-ECB (and maybe a few more in HW). These are not ARMv8 but Cortex-M level MCUs.

The problem and the point I'm trying to make is that a few platforms implement their ECB support in such a way to make it almost useless as a building block. They do not do it as processor instructions the way it's done in x86 (the right way IMHO) but instead it's implemented in a separate co-processor that you program in a similar manner as you setup a typical DMA-transfer. If you aim to encrypt 1KB or more the setup cost for this is negligible and you can get a comparatively good speed. However as we both agree there are very few cases (if any) where you _actually_ want to run ECB over 1KB blocks at a time. When you want to build something like CTR (or CBC), what you need is a fast way to ECB _a single_ AES block (i.e. 16 bytes). With this kind of solution the setting up of the co-processor eats up almost any gains won by doing ECB in HW compared to doing ECB in SW because the cost of the setup (it's I/O after all) comes close to the cost of a SW only ECB of 16 bytes.


Hmm? With CTR you usually just want to fill a long buffer with the appropriate counters and then shove it all through the accelerator. The resulting stream can then be used until exhausted by whatever higher level primitive you're working with. Obviously there's a trade-off in sizing the buffer correctly, but dozens of blocks would be more typical than one.


Yes, and that is what I said in my very first post in this thread. Still you need to handle the counters in SW, do the XORs in SW (unless you have some HW that does that for you as well) and then if you want CCM you need to solve CBCMAC (maybe you have CBC in HW but then there's the memory trade-off again). If you want GCM you need to do BigInt muls (Cortex-M MCUs do not support 128 bit muls). So either way you end up doing pretty substantial parts of it in SW which limits the usability.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: