Why don't they use HSMs instead? The whole point of those hardwares is to preven...

toast0 · on Sept 6, 2023

These guys [1] claim to have "the fastest payment HSM in the world, capable of processing over 20,000 transactions per second." I imagine the peak load for signing authtokens for Microsoft accounts is way higher than that.

[1] https://www.futurex.com/download/excrypt-ssp-enterprise-v-2-...

A1kmm · on Sept 7, 2023

So once an hour, each auth server requests a certificate (for a new private key) from the HSM. It caches that for the hour, and issues certificates for the clients signed by its private key - and puts them in a token including the chain with the cert from the HSM and the cert from the auth server. Clients validate no cert in the chain is expired.

That way, the HSM only needs to do one transaction per hour per auth server. If auth tokens need to be valid for 24 hours, then the certificates from the HSM need to be valid for about 25 hours (plus some leeway for refresh delays maybe).

If someone compromises the auth server and gets the private key (or gets in a position to request a cert from the HSM), then it is still quite bad in the sense that they have up to 25 hours to exploit it. But if this is only one of many controls, it still provides significant defence in depth, and cuts off certain types of attacks, especially for APTs who might not have any available TTPs to gain persistence in a highly secure auth server environment and who only briefly manage to gain access or get access to stale information as in this case.

timmclean · on Sept 7, 2023

Is there a reason why they couldn't split the load across multiple HSM? For something so sensitive I would've expected a design where one or more root/master keys (held in HSM) are periodically used to sign certificates for temporary keys (which are also held in HSM). The HSMs with the temporary keys would handle the production traffic. As long as the verification process can validate a certificate chain, then this design should allow them to scale to as many HSMs as are needed to handle the load...

toast0 · on Sept 7, 2023

HSM are expensive, the performance is bad, and administration is a pain. They're almost certainly running many clusters of their auth servers around the world, and would need significant capacity at all the locations, in case traffic shifts.

It's probably a better idea to pursue short lived private keys, rather than HSMs. If the timeline is accurate, the key was saved in a crash dump in 2021 and used for evil in 2023, monthly or quarterly rotation would have made the key useless in the two year period.

A certificate chain is a little too long to include in access tokens, IMHO, but I don't know how Microsoft's auth systems work.

twisteriffic · on Sept 7, 2023

According to https://www.wiz.io/blog/storm-0558-compromised-microsoft-key...

The key expired in April 2021. Short lived keys only work if you actually check for expiry, which it appears they weren't doing.

olliej · on Sept 6, 2023

They way they discribed I assumed that the crashlog they got was from an HSM?

SgtBastard · on Sept 7, 2023

As a sibling commenter mentioned - if a HSM dumps its memory where it contains private key material, that’s a spectacularly bad HSM, which MS wouldn’t have been able to fix the race condition of.

Reading that MS were able to fix the crashing system’s race condition that included the key, it’s likely to have been a long-lived intermediate key for which the private key was held in memory (with a HSM backed root key for chain of trust validation, assuming MS aren’t completely stupid).

The challenge is the sheer scale these servers operate in terms of crypto-OPS… it would melt most dedicated HSMs.

agrajag · on Sept 6, 2023

It would be a gross failure of an HSM to allow private key material to leak in any way.