U.S. To Build Two Flagship Supercomputers for National Labs

cwal37 · on Nov 14, 2014

I spent some time at Livermore, and I work at Oak Ridge now. It's been interesting to see the difference in how the HPC assets are referred to, or not, and I think it reflects the culture and layouts of the labs.

At LLNL it was constant, and everyone I interacted with had something to say about being at or near the #1 spot in the top 500 list (with Sequoia at the time), and advice on projects to try and get time. I went on a tour of the facility too, and it was really neat to get some perspective on the physical aspects of it all.

I've been at ORNL since February, and haven't heard Titan mentioned once. Partially I think this is due to the more cohesive overall mission of LLNL vs. ORNL, but I think geography plays a role as well.

The labs have fairly similar numbers of employees, but with one major difference; ORNL is spread out over a pretty large area and LLNL is a single square mile. Perhaps as a result of that, groups at ORNL feel a bit more insular. Heck, I've been to the "main" cafeteria once here, because I work on an edge of the campus, but at LLNL I went every day because it was easy to get to.

I wonder if a more compressed area results in a more useful utilization of huge assets like HPC because there are a lot more connections to be made between different departments. Then again, I also got the feeling that at LLNL it was a carrot they could try and dangle to draw people out of the valley. ORNL doesn't really have that same local competition for talent.

EDIT: And the article even specifically mentions the difference in lab missions by noting the new computers' uses: security vs. open science.

marktangotango · on Nov 14, 2014

There was some discussion in Dewars book "To the End of the Solar System"[1] about the different cultures at the national labs related to the development of nuclear thermal rockets. Something about the Los Alamos guys being all about experimentation, and blowing radioactive material out of the tail of the rocket, until an Oak Ridge director was brought in and mandating more modelling of internal behavoir. I may have those labs mixed up, it's been a long time since I've read the book.

[1]http://books.google.com/books/about/To_the_End_of_the_Solar_...

elektronjunge · on Nov 14, 2014

The nuclear rocket stuff was in the 50s and 60s. With the shift to stock-pile stewardship at the end of the cold war all of the labs became fairly focused on modeling. It also depends on what group you work for. I worked for one of the modeling groups so we were obviously talking about it all the time. Many other groups did to. Some were more experiment focused. Most the physicists treated it as a third branch to the classic experimental, theoretical split in physics.

ganzuul · on Nov 15, 2014

Why do nuclear stockpiles require HPC?

icegreentea · on Nov 15, 2014

Since actually testing nuclear weapons is banned, the only way to verify (ha! verify with simulation) that the current nuclear stockpile is reliable is with simulation. In brief, the idea is to model degradation of current warheads and how that effects their performance/reliability.

For example, the National Ignition Facility (the warp core in Into Darkness) was created partly too provide a source of fusion that could be used to verify the computer models used to simulate nuclear weapon stability. Ie, write a general enough model to see what happens in the fusion bit of a nuke, then apply the model to something like what the NIF does, and then actually test it in the NIF. If the NIF experimental outcomes are in agreement with the model predictions, then we now have higher confidence that the model predictions with regards to the actual nukes are useful.

http://en.wikipedia.org/wiki/Stockpile_stewardship

gh02t · on Nov 15, 2014

Really? The times I have visited ORNL, they've pretty much talked non-stop about all the fun things being done on Titan.

I am/was affiliated with CASL and used it for some of my research work. It's a nice system and we got amazing performance, it would have taken literally years to run the same calculations back on the cluster at my university. We didn't really take full advantage, though, as we weren't using the coprocessors. But I guess our mission is computational science, so maybe that's why we heard so much about it.

I just saw the most interesting presentation the other day. Tom Evans' group is apparently working on these nice new methods to solve linear systems via Monte Carlo sampling. It's one of the projects in pursuit of exa-scale systems. The Monte Carlo algorithms are interesting because while they're a lot slower in terms of CPU hours, they do much better in terms of scaling and resiliency at those sorts of extremes.

Xcelerate · on Nov 15, 2014

This is really exciting! (I'm actually running a molecular dynamics simulation on Titan right now.)

What I would like to see though, is also a return to an increase in absolute processing speed. GPUs and more nodes are great for simulating larger systems of molecules and atoms, but they are actually worse for simulating longer timespans. For example, one of my projects was a small carbon crystallite of 136 atoms. I ran that one on my laptop because it would have taken just as long on Titan.

Problems like protein folding require a sequential series of operations where each step depends on the last one. Right now, the solution to this is specific ASIC systems (like Anton) but then that is a lot of money invested into a machine that can only be used for one purpose.

Regardless, most of my work is size bound rather than time bound, so Summit will be great!

eslaught · on Nov 15, 2014

> GPUs and more nodes are great for simulating larger systems of molecules and atoms, but they are actually worse for simulating longer timespans.

For a number of reasons, that's basically not going to happen. CPU single-threaded performance is still increasing slowly, but probably not by enough to satisfy your simulation needs.

In a lot of these cases, the only practical solution (assuming you can't spend the money on custom hardware) is to go back to the code and optimize the hell out of it. Partly this means clever low-level optimizations, but it might involve switching to programming models that help make better use of the hardware. For example, S3D, a combustion simulation which was one of the acceptance tests for Titan, runs about 2x faster under Legion (my research project) compared to the previous OpenACC code hand-tuned by Cray and NVIDIA engineers [1].

If that sounds interesting to you, feel free to contact me, and if you'll be at SC next week maybe we can meet up.

[1]: http://legion.stanford.edu/pdfs/legion-fields.pdf

acadien · on Nov 14, 2014

The joke is the current #1 spot is occupied by a broken machine that has never run at full capacity and spends most of its time OFF or hamstrung to save on power. Tianhe2 was a huge mistake, poorly planned out and poorly implemented.

The American National Labs seem to have the experience and patience to implement new systems that work well enough and deliver on a dollar/flop ratio.

bane · on Nov 14, 2014

Like lots of big programs in China, Tianhe2 was probably more the fulfillment of a national prestige program than a serious scientific undertaking.

Someone1234 · on Nov 14, 2014

Ignorant question: Why are supercomputers still popular? I would have thought that their examples could equally be accomplished with a more much flexible array of less powerful nodes (see Google's search engine as an example).

That way instead of doing a "big bang" upgrade such as this, you just upgrade individual nodes as the technology allows and are almost always "current."

PS - I find their usage of the term "energy independence" hilarious. That was a term coined largely to justify fracking and other environmentally damaging practices in the US. I'm glad to see it has been over-used so much now third parties are using it to justify other projects...

rtkwe · on Nov 14, 2014

The general reason comes down to inter-node bandwidth. Things like Google's search or SETI are able to use standard interconnect (or the open internet in SETI's case) because the problem, link counting and signal analysis, can be broken down into individual pieces that don't interact very much between sections of the computation.

Things like physics simulations, weather etc., have a lot of interaction between any divisions you could try to draw to be able to split the information between nodes. To work with these problems you need a faster interconnect between processors than ethernet provides, infiniband seems a popular choice but I'm no expert on the details of their architecture. These problems also need to move in relative lockstep meaning loosely connected systems don't work as well or at least don't provide any real advantages. Node homogeneity makes these tightly coupled processing nodes easier to manage.

Occasionally there will be upgrades to a super computer but generally they're run 24/7 until they're too far behind the technology, both in speed and lower power consumption, that it becomes cheaper to completely replace it than to try to do an upgrade that will replace more than half the cluster's total electronics to begin with.

epistasis · on Nov 14, 2014

>The general reason comes down to inter-node bandwidth

And sometimes even more than bandwidth, inter-node latency, which is where Inifinband really shines over ethernet. Calculations which have high levels of node-to-node dependencies are pretty much the difference between super computers and the type of stuff that large internet companies compute on.

I hear (but don't personally know) that Google-style large data center installs are moving towards the CLOS-style networks that have been popular in HPC for a long time. These network topologies give equal bandwidth between any pair of nodes, as well as nearly equal latency.

mscman · on Nov 14, 2014

Not just Google. Amazon's "Enhanced networking" uses a feature in 10GbE that's been used in smaller HPC clusters for around 5-10 years now. And MS Azure has InfiniBand backing their highest-tier instance types.

Many datacenters are adopting HPC technologies to reach the scale they need.

cjslep · on Nov 14, 2014

> Why are supercomputers still popular?

Scaling.

A very easy out for me to explain is imagine if you could only simulate two molecules in a Molecular Dynamics simulation and both had to have 100 atoms or less. There's plenty of opportunity, sure. But what if you wanted to simulate an entire strand of RNA interacting with a protein? There are techniques and stable approximations to avoid the O(n^2) complexity of a molecular dynamics simulation but the next-generation hardware allows you to run the same simulations with the same limitations at a much bigger scale, or the previous-sized problem without as many assumptions. edit: There are other considerations of course; I am vastly oversimplifying with this example.

Also, the more simulations you can run the more uncertainties you can account for in Monte Carlo computations, so models can evolve to contain more details and less assumptions.

adding response to: > That way instead of doing a "big bang" upgrade such as this, you just upgrade individual nodes as the technology allows and are almost always "current."

The problem is that programming targeting a supercomputer usually involves taking advantage of specific features of the hardware to get the bleeding edge performance. Mixing and matching hardware quickly makes deployment of the code difficult in addition to writing the code to reliably squeeze every ounce out of every chip.

baltcode · on Nov 14, 2014

I think in most cases supercomputers are in fact "flexible arrays of less powerful nodes". It's just that you need to change the architecture as you scale up the number nodes and type of nodes. Basically, you need fast data transfer and memory sharing between these nodes if you want to do anything but the most embarrassingly parallel problems.

Retric · on Nov 14, 2014

Picture a hurricane in a movie. It's created by an artist and 'rendered' using lot's of processing power. However, each rendered frame is only dependent on the artist so you could render it backwards or farm out each frame to a separate computer.

Now, suppose your a scientist who want's to simulate a hurricane. You create a model for the first step (second, minute, or nanosecond) and can send that to a computer, but what do you send to the next computer. It needs results from the first simulation to run, but in that case you might as well just keep running the simulation on the same computer. Clearly to use a cluster your going to need to break up your simulation into pieces and send each part to a separate computer. But, now after each step all the nodes need to talk to all the other nodes to keep the simulation in sync. Worse the more chunks you separate the problem into the more parts need to be talking to other parts.

This is also why supercomputers tend to have equal strength nodes. Having one node finish before the other is pointless as it needs results from other nodes to continue. In the end you need not just massive bandwidth but also low latency. If a step takes 1/1000th of a second to run but then needs to wait a full second to run the next step then your simulation get's 1 step per second not per 1/1000th of a second.

You also want to back up each node so if one of them crashes you don't need to re run the simulation from scratch.

hvs · on Nov 14, 2014

In a sense, these are essentially arrays of nodes, they just happen to be Tesla GPUs (which you can think of as video cards that don't have a video out). The advantage of these over individual machines is that the I/O speed between machines is much lower than between CPU/GPU.

Jtsummers · on Nov 14, 2014

From the video in one of the links it seems that it's still nodes containing both CPU (POWER) and GPU (Tesla) (some number of each). But instead of PCI-e, it uses a new high-speed bus (NVLink) which allows for direct GPU-GPU communication and for the GPU and CPU to share memory at the same speeds (doesn't rely on PCI-e to connect GPU to CPU).

pmalynin · on Nov 14, 2014

There is now also an RDMA protocol that understands GPU RAM

eslaught · on Nov 14, 2014

See also: NVLink, which is getting rid of the PCI-E bus between the CPU and GPU: https://news.ycombinator.com/item?id=8609071

higherpurpose · on Nov 14, 2014

This is interesting to see, Power8 and Volta together. I wonder if they'll push Power8's bandwidth to match Volta's 1TB/s by 2017 as well. Currently has 230GB/s I think.