Hacker News new | past | comments | ask | show | jobs | submit login

Humans aren't perfect. Open source projects have missed unescaped spaces in directory paths that caused the deletion of /usr (Bumblebee), video games have forgotten to check the cwd and deleted vital windows boot files (Eve Online) and operating systems have forgotten to check passwords (MacOS).

Given this was 2010 I wouldn't be surprised if they had a less mature development process that doesn't use things people take for granted today, like linters, and pre-merge code reviews.

If you assume perfection of humans, maybe 95/100 times it works out and 4/100 it's a small oops, but 1/100 times you get something embarrassing like this




I was like "hold on, 2010 wasn't that long ago!" but thinking about it now, things like Github and its code review flow via merge requests only came into my radar at about 2012, and even then it took a long time before it was widely adopted.

Around that time I was in the middle of the Java ecosystem though, which had tons of tooling for validation, linting, verification, books full of best practices, etc.

Still a far cry from what is normal these days, but we did do things like unit tests, end-to-end tests and code reviews back then. But, that was enterprise, the SF web companies I think weren't as enterprisey as the bank I worked for at the time.

It's ironic though; the old fashioned companies I've worked for adopted a lot of the more fast-and-loose-feeling practices from SF, think extreme programming, agile, taking a chance on rewriting the whole back-end to a new language, things like that.


Completely disagree with the aproach humans are not perfect. This was not a one opportunity only event. Every test case they did not run, every load simulation they did not perform, every chaos test they decided to overlook was an example of humans failing repeatedly. Massive websites were running since the early nineties...And robust software practices are a thing of the sixties.


Agreed, but this was not a solid company in the first place. Rather, Digg was (according to TFA) a failing startup losing money hand over fist that was launching v4 as a hail mary. They were so hard up that no hardware was available for hosting the new environment, so they started reusing v3 servers for v4 while the migration was ongoing.

They bet it all on this v4 thing, and lost.


Funny you should mention that there was no hardware available. Digg had a second cage full of servers in VA as part of a failed attempt to go multi-DC. If that project had ever worked out there would have been 50% more capacity. Those servers and network gear just sat there unused until they finally got liquidated for pennies on the dollar.


> Humans aren't perfect.

Humans are good at designing processes and procedures that compensate for their lack of perfection. In this case that capability was not used apparently.

Anyone can make a mistake and that's fine, but a company would be expected to have processes and enough eyes to make sure that a bunch of other people need to make the same mistake before it makes its way into prod at which point the likelihood of mistake becomes really low...


Not knowing Python, but I have a feeling things like this weren't as well known back in 2010 and the reason they're so well known is from people doing it at sites like Digg and then telling everyone that's a really bad idea. There are so many things that I know now for programming that were pretty much unknown 10-years ago because no one had really encountered them in a large scale setup.


Nah, this has been very well known since decades and is very easy to spot.


Yeah this is covered in pretty much every style guide, tutorial, and reference. And is an easy screener in an interview. I'm also surprised anyone using Python as more than a toy experiment had this issue, especially in such a critical service.


I've worked on a lot of Python as a hobbyist and student, much of it in real world use - probably at least 100kloc - and I've never encountered this. Where is this information? I'm worried that I'm missing some other "obvious" things. I've of course run into python's shallow vs deep copies, but I don't remember default values being shared between invocations.


If you do a search for "python common gotchas" it's almost certain to come up, usually pretty prominently.


Still... did no one with even moderate Python experience even glance at this very important endpoint at any stage during those 4 weeks? Like I say, this is the kind of thing that jumps out of the screen for an experienced developer.


In 2010? You'd be surprised before widespread adoption of git at how many places the process was individual developers committing changes unsupervised. I'm not sure we even disagree that's the problem - you feel a second person should have looked at it, I feel the process should require that. A code review might be something that they'd sit down with a selected piece of code that they felt was risky on a biweekly basis or something.


I’m 2010 access to prod over ftp:// might have still been a thing even at digg.


Forget code review. This is one of their most important endpoints. They had severe issues for 4 weeks. Did no one not even glance over the code and spot this glaring screw-up?


Considering the context, most people were probably overloaded with work and a lot of changes must have been made with little to no oversight.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: