There are three reasons why that approach tends to work well in scientific compu...

jedbrown · on Oct 21, 2010

First off individual computational errors are rarely important.

This is completely wrong. Arithmetic errors (or memory errors) tend not to just mess up insignificant bits. If it occurs on integer data, then your program will probably seg-fault (because integers are usually indices into arrays) and if it occurs in floating point data, you are likely to either produce a tiny value (perhaps making a system singular) or a very large one. If you are solving an aerodynamic problem and compute a pressure of 10^80, then you have might as well have a supernova on the leading edge of your wing. And flipping a minor bit in a structural dynamics simulation could easily be the difference between the building standing and falling.

I would argue that data mining is actually more tolerant of such undetected errors because they are more likely to remain local and may stand out as obviously erroneous. People are unlikely to die as the result of an arithmetic error in data mining.

Second, there is a minimal time constraint,

There is not usually a real-time requirement, though there are exceptions, e.g. http://spie.org/x30406.xml, or search for "real-time PDE-constrained optimization". But by and large, we are willing to wait for restart from a checkpoint rather than sacrifice a huge amount of performance to get continual uptime. If you need guaranteed uptime, then there is no choice but to run everything redundantly, and that still won't allow you to handle a nontrivial network partition gracefully. (It's not a database, but there is something like the CAP Theorem here.)

Retric · on Oct 22, 2010

For the most part you can keep things reasonable with bounds checking. If pressure in some area is 10x those around it then there was a mistake in the calculation. If your simulating weather patterns on earth over time, having a single square mile 10 degrees above what it should be is not going to do horrible things to the model. Clearly there are tipping points but if it's that close to a tipping point the outcome is fairly random anyway.

Anyway, if you could not do this and accuracy is important, then you really would need to double check every calculation because there is no other way to tell if you had made a mistake.