Hacker News new | past | comments | ask | show | jobs | submit login

Agreed that's impressive debugging for this issue. But...

> 10 minutes to roll out the fix

That seems very slow to me. 30% of their down time was because their deploy process is slow.




http://nickcraver.com/blog/2016/05/03/stack-overflow-how-we-...

FWIW here's a write up on their process

Also I imagine that 10 minutes included dev and testing, not just the deployment part of "rolling out"


Seemed to say that 14 minutes were spent writing the code to fix it which I assumed meant testing it, but ya not entirely clear.

Maybe "deploy" means the "Deploy" section of this article: http://highscalability.com/blog/2014/7/21/stackoverflow-upda...

Seems to target only being able to "deploy 5 times a day". I guess maybe the build time is the limiting factor.


Correct. 10 minutes was from checkin to all servers built out, including a dev and staging tier.


10 minutes roll out to production is insanely fast. Roll out usually goes through build, test, staging, and to farms of production servers, with smoke tests in each stage along the way.


Rollout to thousands of servers on WordPress.com is typically less than 60 seconds. We optimize for reverting fast.

Its just interesting to me the implications of what folks optimize for and that this is considered fast. We have very minimal deploy testing and optimize to be able to revert quickly when there are problems because performance issues like this are very hard to predict. Probably means we create many smaller short hiccups though (that generally are not a full site crash).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: