Hi Peter, another ex-inktomi/ex-yahoo guy here. I worked on this infrastruture m...

0xbadcafebee · on Aug 27, 2013

What you're saying makes total sense in terms of complexity and time for turning out a whole new platform. But to my view it depends a lot on your application.

Was the app myrinet-specific? If so, I can understand increased difficulty in porting. But at the same time, in 1999 and 2000, people were already building real-time clusters on Linux Intel boxes with Myrinet. (I still don't know exactly what time his post was referencing) If Diego's point was that they didn't move to Linux because Gigabit wasn't cheap enough yet, why did they stick with the expensive Sun/Myrinet gear, when they could have used PC/Myrinet for cheaper? I must be missing something.

I can imagine your topology changing as you changed your algorithms or grew your infrastructure to work around the massive load. I think that's natural. My point was simply that making an attempt to understand your limitations and anticipate growth is completely within the realm of possibility. This doesn't sound unrealistic to me [based on working with HPC and web farms].

What I meant to say was "every single issue", as in, individual problems of scale, assumptions made about them, and how they affect your systems and end users. It's a broad paper that generically covers all the basic "pain points" of scaling both a network and the systems on it. You're going to have specific concerns not listed, but it points out all the categories you should look at. I believe it even went into datacenter design...