> *Network is much much more reliable than it used to be.* But would anyone thin...

> Network is much much more reliable than it used to be.

But would anyone think of the bit flips?

> We've now determined that message corruption was the cause of the server-to-server communication problems. More specifically, we found that there were a handful of messages on Sunday morning that had a single bit corrupted such that the message was still intelligible, but the system state information was incorrect.

> We use MD5 checksums throughout the system, for example, to prevent, detect, and recover from corruption that can occur during receipt, storage, and retrieval of customers' objects. However, we didn't have the same protection in place to detect whether this particular internal state information had been corrupted. As a result, when the corruption occurred, we didn't detect it and it spread throughout the system causing the symptoms described above. We hadn't encountered server-to-server communication issues of this scale before and, as a result, it took some time during the event to diagnose and recover from it.

https://web.archive.org/web/20150726045623/http://status.aws...

See also: At scale, rare events aren't rare, https://news.ycombinator.com/item?id=14038044 (2017).