The speed of the interconnect really matters for most supercomputer problems. Without knowing the characteristics of that interconnect I would be hesitant to call 10,000 machines at Amazon a supercomputer.
I wonder how they'd count Folding@Home, then: 500K active clients, 6M total clients, but only a fraction of their clients are active at any given point in time.
I highly doubt it is anything better than a shared gigabit ethernet connection, which makes me doubt that it has any chance at all to get to TOP500 levels, and I doubt you could get enough of the HPC instances (where you are kind of promised a dedicated network if you get enough of them) to get 5k cores.
Not everyone gets spikes during holidays--for example Amazon and other retailers get super busy at Christmas, but it's the slowest time of the year for a lot of web apps.
That said, the last two Decembers we have had trouble instantiating new instances (in US-east-1a, their default and hence most popular ___location) in December. We solved the problem just by switching to US-east-1b for some tasks.
"In order to prevent an overloading of a single availability zone when everybody tries to run their instances in us-east-1a, Amazon has added a layer of indirection so that each account’s availability zones can map to different physical data center equivalents."
This is the beauty of aggregation: when you add a lot more different types of users to the system, the load tends to be balanced out because not everybody needs the same resource at exactly the same time. As long as the distribution of different resource requirement is relatively even, the larger the system is, the stabler it becomes.
Statistically, adding multiple standard derivations yield a smaller value than their sum.
This kind of edge cases will happen much more frequently when you have multiple smaller systems. A larger system helps you by balancing out those. Will a larger system run out of resources if everyone requests it at the same time? Of course it will, but the likelihood will be statistically smaller than smaller systems.
EC2 sprang from the problem that Amazon had to buy a bunch of servers to handle the load around the holidays and these servers went underutilized during the rest of the year. So they decided to lease those resources.
When asked about what happens to EC2 during the holidays, the engineer basically replied that Amazon has priority.
The problem in HPC is less often pure CPU horsepower though, it is often cache or memory bandwidth, or in the interconnects.
I guess you might be able to build a system in the cloud to provide TOP500 level of performance, but it would be pretty hard even with the fancy EC2 HPC instances (http://aws.amazon.com/ec2/hpc-applications/).
Thanks for pointing out the HPC instances that Amazon has. A few commenters were saying that it's not really a supercomputer without a fast interconnect. Yes, they have that! You just pay more for those instances.
In my experience Amazon did a pretty good job setting things up. It's fun to play around with HPC instances, you can get some sweet performance.
Amazon can. That has no information about how the nodes were allocated. They could have hand picked X rack of nodes that were all connected via the same switch, etc. You don't get that guarantee from AWS.
Fair enough, I guess.... they could have done many things.
Although they do not provide an answer, here are some links to additional info - I spent some time searching for additional info on the Top500 setup, but found little:
> its calculations were "embarrassingly parallel," with no communication between nodes
That's probably the only type of process that would work in the cloud. Most HPC applications require lots of communication between nodes, so I don't think I would call this a proper supercomputer.
I must say these are my favorite types of articles on HN. I also think these are the perfect use cases of cloud computing platforms such as AWS. Not sure why massively parallel and "embarrassingly parallel" computing intrigues me.
Am I missing something? If the performance scales linearly, they are at 1000 computers internally (1/10 * 10,000), and it was said to take 8 hours. That would only be 80 hours if they hadn't have used this service.
This makes me believe someone is lying about something in this article.
Perhaps their internal capacity is already tied up in other tasks, so while they have 1000 cores internally, they can't all be monopolized for 80 hours for a single task like the AWS machines can.
You are misunderstanding. P=NP is a theoretical, in some ways even a philosophical question about the limits of logic and knowledge. Many problems that are NP complete have algorithms to find almost perfect answers in polynomial time. P=NP is about knowing if you can find perfect answers.
Then, there are many problems out there which are not NP complete for which we are nowhere near to finding fast, accurate solutions. The problem is not that logic prevents us, but that we're simply not clever enough yet.
What I'm trying to say in a roundabout way is that spinning up many cores will not help you find a perfect, fast solution to an NP complete problem. And just because you have 10,000 cores that is not an indicator of it being difficult or hard to solve a given problem, regardless of its complexity class.
Example: a brute-force attack of an encryption algorithm that uses a 256-bit-key, would require trying out all possible keys, which is 2^256 ... which right now it would take far longer than the age of the universe to complete.
AND, most importantly, dividing that number by 10,000 (the number of computers in the article), or heck, let's be generous and say we have 1,000,000,000 computers ... would be absolutely meaningless.
It's simple really -- 2^256 / 1 billion computers =~ 2^226 -- and computing it still takes far longer than the age of our universe.
And lets say that with technology advances, you can have 70,000,000,000 computers (that's 70 billion computers, or a 700,000,000 % increase from the number in our article). Nevermind the energy required to power them or the storage capacity needed, or other such none-sense. So instead of 2^226, you now have 2^220 cycles to go through, an absolutely meaningless decrease, and still takes far longer than the age of our universe.
As a fun exercise, try figuring out how many computers would be required to bring that number down to ~ 2^200 -- that would still take far longer than the age of our universe to compute ;)
Humour is pretty much seen as noise on Hacker News. If a joke also has some insight about the article, it will get upvotes, but if it's just a joke it gets downvoted.
Seems like those guys knew nothing about HPCs. Why didn't they run LINPACK test? It's essential to measure any parallel computing system even of two cores. Also, any first grade CS student knows that the most significant part of an HPC is not the cores, but the network. You need to connect hosts using Infiniband or alike. Using regular ethernet is futile because of high latency, you will waste 90% of CPU cycles in data exchange/syncronization wait loops. I bet they could achieve a way better results on just 1/3 number of cores or even less.
LINPACK is about as useful of a benchmark as BogoMips.
""". Genentech benefited from the high number of cores because its calculations were "embarrassingly parallel," with no communication between nodes, so performance stats "scaled linearly with the number of cores," Corn said."""
Not that I think it entirely bursts your internet tough guy rant, but amazon does offer [cluster computing instances](http://aws.amazon.com/ec2/hpc-applications/) for exactly this purpose. Granted, they only have 10 Gigabit ethernet, but it's not exactly like this is some failure cluster running all over a busy datacenter on 10M ether.