In many ways pagerank is a special case where you need a lot of resources for a ...

jedbrown · on Oct 21, 2010

You use "PMI" consistently, but I assume you mean "MPI". MPI is not a silver bullet, and if your problem can be reasonably solved with nearly asynchronous algorithms such that fault tolerance is a bigger problem than network performance, then MPI is probably not a good choice. But this is a rather narrow definition of "scale" (and much different from the standard definition over the last several decades of scientific computing), and for more tightly coupled problems, MPI can beat these alternative paradigms by huge margins.

Retric · on Oct 21, 2010

The more cores you have the more important dealing with failure becomes. Multi CPU super computers have been dealing with these issues for a while. 2-3 CPU failures per day is fairly common and losing a rack in the middle of a job is way too common.

jedbrown · on Oct 21, 2010

Yes, however the algorithms are such that failure cannot be made strictly local. For example, when solving a stiff PDE, there is unavoidable and quite complex global communication at every time step (indeed, within the nonlinear solves required in each time step). There is no hope of "continuous local checkpointing" or the like because the program state is changing to the tune of gigabytes per second per node.

The most common response is to perform global checkpoints often enough that hardware failure is not too expensive, but infrequently enough that your program performance doesn't suffer too much. Bear in mind that our jobs are not normally hitting disk heavily, so frequent checkpointing could easily cost an order of magnitude. When hardware failure occurs, the entire global job is killed and it is restarted from the last checkpoint. There are libraries (e.g. BLCR) that integrate with MPI implementations and resource managers for checkpoint, restart, and migration.

Note that it is not possible to locally reconstruct the state of the process that failed because there just isn't enough information unless you store all incoming messages, which can easily be a continuous gigabyte per second per node. Even if you had a fantastic storage system that gave you ring storage with bandwidth similar to the network, local reconstruction wouldn't do much good because the rest of the simulation could not continue until the replacement node had caught up, so the whole machine would be idle until during this time. If you have another job waiting to run, you could at least get something done, but the resource management implications of such a strategy are not a clear win.

So if hardware failure becomes so frequent that global checkpointing is too costly, the only practical recourse is to compute everything redundantly. Nobody in scientific computing is willing to pay double to avoid occasionally having to restart from the last checkpoint, so this strategy has not taken off (though there are prototype implementations).

Retric · on Oct 21, 2010

There are three reasons why that approach tends to work well in scientific computing circles.

First off individual computational errors are rarely important. EX: Simulating galactic evolution as long all the velocity's stay reasonable each individual calculation is fairly unimportant and bounds checking is fairly inexpensive.

Second, there is a minimal time constraint, losing 30 minutes of simulation time day is a reasonable sacrifice for gaining efficiency in other areas.

Third, computational resources tend to be expensive, local, and fixed. AKA, Cray Jaguar not all those spare CPU cycles running folding at home.

However, if you’re running VISA or World of Warcraft then you get a different set of optimizations.

jedbrown · on Oct 21, 2010

First off individual computational errors are rarely important.

This is completely wrong. Arithmetic errors (or memory errors) tend not to just mess up insignificant bits. If it occurs on integer data, then your program will probably seg-fault (because integers are usually indices into arrays) and if it occurs in floating point data, you are likely to either produce a tiny value (perhaps making a system singular) or a very large one. If you are solving an aerodynamic problem and compute a pressure of 10^80, then you have might as well have a supernova on the leading edge of your wing. And flipping a minor bit in a structural dynamics simulation could easily be the difference between the building standing and falling.

I would argue that data mining is actually more tolerant of such undetected errors because they are more likely to remain local and may stand out as obviously erroneous. People are unlikely to die as the result of an arithmetic error in data mining.

Second, there is a minimal time constraint,

There is not usually a real-time requirement, though there are exceptions, e.g. http://spie.org/x30406.xml, or search for "real-time PDE-constrained optimization". But by and large, we are willing to wait for restart from a checkpoint rather than sacrifice a huge amount of performance to get continual uptime. If you need guaranteed uptime, then there is no choice but to run everything redundantly, and that still won't allow you to handle a nontrivial network partition gracefully. (It's not a database, but there is something like the CAP Theorem here.)

Retric · on Oct 22, 2010

For the most part you can keep things reasonable with bounds checking. If pressure in some area is 10x those around it then there was a mistake in the calculation. If your simulating weather patterns on earth over time, having a single square mile 10 degrees above what it should be is not going to do horrible things to the model. Clearly there are tipping points but if it's that close to a tipping point the outcome is fairly random anyway.

Anyway, if you could not do this and accuracy is important, then you really would need to double check every calculation because there is no other way to tell if you had made a mistake.