Notes on Google's Site Reliability Engineering Book

andrewstuart2 · on April 11, 2016

Granted, I'm only on the fifth chapter currently, but this is the first IT-oriented book that I've genuinely had a hard time putting down. It's so exciting to be able to see so many components of such a successful software (and hardware) organization, and as they mention, to see the reasoning behind the decisions rather than just a dump of "here's what we decided."

sciurus · on April 11, 2016

IMHO there is also a look of good material in Part II of http://the-cloud-book.com/

packetslave · on April 11, 2016

Not coincidentally co-written by a former Google SRE

taude · on April 11, 2016

This book is great. I highly recommend to anyone architecting modern, cloud-based applications. At the very least it introduces a lot of concepts that many different open source projects implement. It'll give you the "why", when you're looking at the "how".

bcook · on April 11, 2016

Similarly, I am incredibly happy to see the interesting blog posts by CloudFlare, Netflix, Mozilla, etc. Share knowledge!

dijit · on April 11, 2016

I am also enjoying the book currently, I really do like it although it's driven home the fact that I cannot get by (as a systems engineer) without low level programming knowledge and experience.

it's a strange feeling seeing the end of the line for your skills that took a decade of slow grinding to acquire.

cthalupa · on April 11, 2016

I'm not sure I agree that you need low level programming knowledge and experience. The industry is huge, and there's needs for a lot of talent with a variety of skills.

If anything, I would argue that the need for the low level languages is disappearing for the general case. As "The Cloud" gets bigger and bigger, and the low level things are handled more and more by a service providers, and they abstract more and more of the day to day, it makes it easier for you to focus on higher level problems.

I do think that the days of being purely ops and needing nothing but shell scripting are going away though. You gotta pick up some python or similar at this point.

toomuchtodo · on April 11, 2016

> I do think that the days of being purely ops and needing nothing but shell scripting are going away though. You gotta pick up some python or similar at this point.

Maybe 5-10 years down the line, but not this year or next.

dijit · on April 11, 2016

I did mean python when I said scripting, but I don't know all it's symantics and internals.

this is mostly due to a lack of experience, I guess it's time to really sit down and try to craft some larger projects.

seneca · on April 11, 2016

I was feeling the exact same way about a year ago. The thing is, once you pick up a programming language, all of your other hard earned Ops skills are still very relevant and will become much more valuable.

Without learning some programming you will go the way of the dinosaur, but by learning a bit you can become significantly more effective than you were before.

andrewstuart2 · on April 11, 2016

I don't necessarily think we're at the end of the line for any skills, just perhaps at the beginning of the line (only time will tell) for a new type of split. Either way, existing knowledge seems not to be EOL, but rather complements skills that were traditionally kept separate.

Additionally, it always takes plenty of time for legacy systems and processes to become forgotten, so even for those who can't or simply don't want to adapt, I feel like there's plenty of work still out there at places where change must occur more slowly, or where long-term investments have been made.

pjungwir · on April 11, 2016

I'm a freelancer who always winds up as "the ops person", and I'm just waiting for my copy of this book to come from Amazon. I watched a video of a talk about this book---sadly I can't find it on Youtube anymore. A HN comment linked to it last week. It was interesting but I'm hoping the book will have a lot more meat.

A few thoughts:

> Google places a 50% cap on the amount of “ops” work for SREs: Upper bound. Actual amount of ops work is expected to be much lower

I didn't catch the "upper bound" part from the talk. Good to know! I really enjoy being a developer-who-does-ops. I wouldn't want to be a sysadmin, and 50% ops is probably my limit for happiness.

> I don’t really understand how this is an example of circumventing the dev/ops split

I felt the same way from the Youtube talk. I think there must be a lot behind the SRE role that makes it successful or not: culture, policies, who you hire, how you train, etc. Also I feel like the best sysadmins have been encouraging coding and automation for a long time, e.g. Thomas Limoncelli. But I've certainly been on the "dev" side of the dev-vs-sysadmin fight before, and it makes sense to be seeking ways to improve things.

> Error budget. 100% is the wrong reliability target for basically everything

I think I saw just this month that Google Apps uptime is 99.95%? Some major Google service. I remember in the early 2000s everyone cared about "5 9s", and I feel like for most of us that is just not worth the effort.

> Chubby was so reliable that teams were incorrectly assuming that it would never be down

This reminds me of Nygard's point in Release It! that your theoretical best SLA is the product of your dependencies' SLAs, e.g. 0.999 * 0.999 = 0.998. But in the world of microservices, this logic seems likely to make you underestimate your uptime.

Also I think Feynman's remarks about the Challenger accident apply here: if you are building a new product with, say, 5 microservices, you don't know the reliability of any of them yet. It's dubious to estimate low-frequency events based on "it's hasn't happened yet."

Thanks for sharing your notes. I'm envious you've already got a copy. :-)

cabacon · on April 11, 2016

Regarding "theoretical best", I think that is "in the absence of mitigations". I think you can build a service with a higher SLA than one of its dependencies, but only if you recognize that impedance mismatch and build in defenses.

As a contrived example, if you've got a microservice that provides data FOO about a request that isn't actually end-user critical, you can mitigate your dependency on it by allowing your top-level request to succeed even if the FOO data is missing. Or maybe you can paper over blips of unavailability with cached data.

But, yes, know what you depend on and how reliable they are, then see if you need to take more action than that if your target is higher than the computed target.

asuffield · on April 11, 2016

(Tedious disclaimer: my opinion only, not speaking for anybody else. I'm an SRE at Google)

Building reliable services out of unreliable dependencies is a part of what we do. At the lowest level, we're building services out of individual machines that have a relatively high rate of failure, and the same basic principles can be applied at every layer of the stack: make a bunch of copies, and make sure their failure modes are uncorrelated.

packetslave · on April 11, 2016

Ben's talk from SREcon 2014 is pretty good at giving the underlying philosophy behind SRE:

https://www.usenix.org/conference/srecon14/technical-session...

pjungwir · on April 11, 2016

Here is the comment linking to the talk I mentioned: https://news.ycombinator.com/item?id=11448482

rhizome · on April 12, 2016

In the early 2000s all the cool people were running Sun E450s, mod_* and Oracle, which makes 5 9s a little more reasonable than we see today with multiple layers of things to go wrong on less-managed platforms. I'm not throwing shade on any particular technologies, I'm just saying that the deployment and configuration management side of things these days is more complicated.

geocar · on April 12, 2016

> I feel like for most of us that is just not worth the effort.

I feel like the exact opposite.

What exactly is going to happen when your page doesn't load?

If you're Google or Amazon, the user will try again.

Otherwise the user is going to hit the back button.

tamana · on April 12, 2016

What is the bottom line cost of losing 0.05% of your customers?

kbenson · on April 12, 2016

I don’t really understand how this is an example of circumventing the dev/ops split.

My understanding (if I understood correctly), from talking to a friend who is an SRE, is that SREs are also part of the design process. The developers want resources, so they contact and work with SRE teams to make sure their project is both planned for in capacity and can be efficiently served. If it can't be served, maybe another component needs to be deployed that makes the data efficiently usable for the new app or feature (I'm unsure on this, but it sounded like it may have been implicated).

That is, SRE teams become devs of certain components of the project, and work to support the project when in development. This should defeat some of the dev/ops split, because SREs also work on the same project, and are invested in its launch and success.

bbrazil · on April 11, 2016

> Avoiding magic includes avoiding ML?

When it comes to alerting, yes. I've seen it tried many times by competent engineers. The problem is that once you get beyond toy examples into situations with even a mere 10k time series there's so much noise that you can't get any useful signal.

> We could really use something like Outalator, though.

I've not found anything like it yet, unfortunately. Hopefully someone will be inspired to write one by the book, there should be enough detail there to do it.

orestes910 · on April 12, 2016

Forgive my ignorance here, but what is he referring to when he says "Magic Systems" and "ML"?

bbrazil · on April 12, 2016

ML is machine learning.

Magic Systems is anything other than a manually configured (mostly) simple threshold for alerting.

wyldfire · on April 11, 2016

> the request was rejected because the error case should never happen.

I haven't run into this mindset much at my current job. But in general I think I've been able to lobby for "well, can we at least have a special case that would leave a breadcrumb behind if it does occur?" That way the investigation when it does inevitably occur is swift and there's less debate among ambiguous choices about how to change the design going forward.

I've also found fault injection testing as a great way for disproving statements about what "can never happen."

That said, I've seen the other extreme too -- checking pointers against NULL just prior to dereferencing at every opportunity up and down the stack. In these cases function/module authors succeed only in moving the eventual crash to somewhere far disconnected from the origin of the problem.

jamesblonde · on April 11, 2016

What i find interesting about how Google/AWS/Netflix are setup is their interesting line between ops and devops. Development teams are expected to run their own services - but after a while. SREs are there to help make the transition. They are the ops experts. I think for smaller shops and startups, there is an important lesson - don't throw out ops! Your in-house ops people are Google's SREs. Most devs I work with could not run services in production without significant help. Google, etc, have great structures in place to handle this. Smaller shopps should take care.

baus · on April 11, 2016

I don't think the trend toward having devs handle all the ops is a good idea. Ops teams, for good reason, tend to be way more conservative than devs. Plus I don't think it is effective to constantly interrupt dev teams with operational issues.

mentat · on April 11, 2016

Interrupting them is what pushes them to push quality code not just throw it over to ops and hope for the best. It works quite well with deployment velocities that would shock even normal devops shops.

jamesblonde · on April 11, 2016

Agreed, but there is definitely a trend towards pushing devs to think more about ops. We internally use chef/kitchen and get devs to run their code on kitchen unit tests before it can be considered ready for a PR. Previously, we only did unit tests, and it's a great improvement. But, yes, the bar is much higher from that to actually running in production.

djb_hackernews · on April 11, 2016

They are conservative because of the ops/dev divide. The constant interruption is a sign of bad ops/dev communication.

Unified, the devops have confidence in what they are deploying and actively work to minimize interruptions via automation.

rohitnair · on April 11, 2016

In AWS, if anything, it's the other way around. Devs always run their own services at first, and ops teams start helping out once the service reaches some kind of scale where a dedicated ops team makes sense. (source: I work for AWS)

mandeepj · on April 11, 2016

This book is available for 50% discount at o'reilly. In case you are interested to buy. I just bought it. It seems to be a great resource

rguldener · on April 11, 2016

Unfortunately shipping at O'Reilly is $49 for my address in europe - literally more than the book + ebook bundle. Amazon on the other hand has the book for €30 and free shipping. I always wonder why companies like O'Reilly think there is no decent market for them here, would have loved to order directly from them and get the ebook as well.

slyall · on April 11, 2016

I didn't even work out what the Oreilly shipping was since it required me to login and go right to the end of the ordering process.

I tend to avoid ordering books from Amazon too since they charge $5 per book plus $5 per order which usually makes them uncompetitive. Strangely enough I end up ordering most of my books via the Book Depository which is owned by Amazon anyway.

georgeaf99 · on April 11, 2016

The paperback version is also available on Amazon for $26 (it is currently out of stock though).

http://www.amazon.com/dp/149192912X/ref=cm_sw_r_tw_awdm_Tsad...

geerlingguy · on April 11, 2016

And a link: http://shop.oreilly.com/product/0636920041528.do

piquadrat · on April 11, 2016

I don't see any discount. Did you use a code?

mandeepj · on April 11, 2016

Yes, I used this code - WCYAZ

Source - http://www.retailmenot.com/view/oreilly.com?c=5659596

piquadrat · on April 11, 2016

Thanks!

endlessvoid94 · on April 11, 2016

Bought this book last week and intend to get through it shortly. But, I'll also plug this excellent paper by James Hamilton (of AWS): "On Designing and Deploying Internet-scale Services" - https://www.usenix.org/legacy/event/lisa07/tech/full_papers/...

mattupstate · on April 11, 2016

> extra pay for being on-call (time-off or cash)

This!

serge2k · on April 12, 2016

What amazing concept, paying people extra money for extra work.

ggambetta · on April 11, 2016

This was pretty amazing. After my first year I ended up with something like 45 days off (including the standard 25, so +20 thanks to SRE!)

lamontcg · on April 11, 2016

```I don’t really understand how this is an example of circumventing the dev/ops split. I can see how it’s true in one sense, but the example of stopping all releases because an error budget got hit doesn’t seem fundamentally different from the “sysadmin” example where teams push back against launches. It seems that SREs have more political capital to spend and that, in the specific examples given, the SREs might be more reasonable, but there’s no reason to think that sysadmins can’t be reasonable.```

Seems we don't understand the point of SREs at all.

In a world where "ops" and "dev" are split and sysadmins occupy "ops" it is customary that system admins are not programmers, may not know how to program, do not venture into the VCS for the codebase and may not even have rights to check-in code to the software. It would certainly be unusual to see check-ins to the codebase from the ops/sysadmin team.

This leads to the situation where you have 10 year old codebases running on 10 year old frameworks on 10 year old operating systems. The system admins are naturally tearing their f---ing hair out over this situation. The devs, however, have the platform on life support and are off writing new code for shiny systems because that's a lot more interesting and useful than keeping the old garbage on life support. No progress is made and the problem typically doesn't resolve itself until the old systems become sufficiently problematic that the devs rewrite the entire system.

If you have an SRE model it should never get quite this bad. The Devs will support the code in production until they're ready to hand it over for maintenance. When it is handed over the SREs get all the keys to the kingdom and have the rights, responsibility and ability to fix bugs in the software they're running.

If you have legacy codebases run entirely by ops people who don't have any ability to maintain the codebases then you aren't doing SRE.

This is one of the manifestations of the "chinese wall" between ops and dev (which is what "DevOps" and "SREs" are entirely antithetical to--it may be hard to define what those terms /are/ but it is pretty easy to define some patterns that they definitely /are not/). If "Ops" has to come begging to "Dev" to fix their software then you're not doing it right.

bbrazil · on April 11, 2016

> When it is handed over the SREs get all the keys to the kingdom and have the rights, responsibility and ability to fix bugs in the software they're running.

Bug fixing is still the responsibility of the developers, which isn't to say that SRE won't help out at times but it's not their role.

> If you have legacy codebases run entirely by ops people who don't have any ability to maintain the codebases then you aren't doing SRE.

An SRE is not a maintenance engineer. Service ownership is always a partnership between SRE and developers. If there's no developers, then there's no SREs.

aiiane · on April 12, 2016

And in fact, it's common for SREs to hand obsolete services back to the dev team if they've been mostly phased out to the point where they're no longer the primary priority.

sn9 · on April 11, 2016

> First, I normally take pen and paper notes and then scan them in for posterity. Second, I normally don’t post my notes online, but I’ve been inspired to try this by Jamie Brandon’s notes on books he’s read. My handwritten notes are a series of bullet points, which may not translate well into markdown. One issue is that my markdown renderer doesn’t handle more than one level of nesting, so things will get artificially flattened. There are probably more issues. Let’s find out what they are! In case it’s not obvious, asides from me are in italics.

This is a problem that would be overwhelmingly solved by org-mode in Emacs. Writing the summary in org-mode means you're a 'C-c C-e' away from exporting to HTML or LaTeX.

0xbadcafebee · on April 11, 2016

I don't know anything about Google and its SREs. Do they really just hire software devs and expect them to be good at Ops? The opposite, hiring a sysadmin and then assigning them to a software team to develop a product, seems equally problematic.

btilly · on April 11, 2016

They want a cross between a developer and a sysadmin.

When I was there they hired people with one background and hoped they would be capable of the other. This didn't work out for about half of them. Making things worse, they had no procedure to say, "OK, we hired someone we think is good, but this is clearly the wrong role for them." Which lead to losing a lot of people who had potential.

Making it personal, I interviewed for a pure dev position then was offered a job as an SRE. It turns out that I don't make a good sysadmin. After I left, I was amazed at how many people I met knew of someone else who had had the same experience.

Ironically recruiters see a year as a Google SRE on my resume and think I might want to join SRE teams at other companies. Most of that stopped after I changed my Linked In profile to say that I am never interested in SRE jobs.

fishnchips · on April 11, 2016

Yup. Exactly my experience.

tayo42 · on April 12, 2016

Just curious, what made you think you weren't a good admin?

btilly · on April 12, 2016

Explaining the concretes would be a lot of unpleasant detail. The why is far more interesting.

I am very good at "tunnel vision". Taking a thread and following it in depth. Completely learning a system in depth so that anything that comes up, I know exactly where to go and what to do.

I am weak on "peripheral vision". Keeping track of 15 balls in the air, and operating with limited information about each. Being effective with limited context about each individual system because there isn't time to learn any of them in depth.

Tunnel vision is a virtue in a software developer. Peripheral vision is essential for a sysadmin.

Now imagine a person with poor peripheral vision trying to learn a system as big and complex as Google's. And being responsible to support 15 different pieces of software written by 15 different teams so that when anything went belly up with any of them, you can trouble-shoot and get it running again.

Nothing particularly bad happened, but I also wasn't accomplishing the job to the standards that Google wanted in that role. And I was not the only person on my team failing in that way.

Ideally it would have been my manager's job to say, "I recognize that this person isn't working out here, is there a better role for them?" That didn't happen. Several months later my manager got fired. I was privately told that my situation was a trigger, but I don't know details.

The fact that my situation is fairly common strongly suggests that the ultimate failure was organizational and systemic. I don't fault Google for hiring developers into their hybrid developer/sysadmin role. I do fault them for not having an explicit onramp/reconsideration process to mitigate the risk that they create by doing so.

0xbadcafebee · on April 14, 2016

Well that's just the thing - an organization that has a "system" or "process" for handling employees and their work is by design less than flexible. In a small to medium sized business, you might have 3 completely separate roles, and others might pitch in as needed. But at a large corporation, it's simpler and more efficient for them to have one person who does one job. When the peg no longer matches the hole, they are replaced or reassigned. Which is sad, because people with valuable skills are often underutilized. I feel like companies like Valve might have the right idea going forward.

btilly · on April 14, 2016

The problem is that larger organizations actually need to have a defined process of some sort. There is a point beyond which individual judgment doesn't scale.

In general I believe that Google has a pretty good process. However every process has bugs. And I happen to have encountered one where they pick people who are qualified in one role into a more experimental one, and it only sometimes works out for them.

bobp127001 · on April 11, 2016

It's notoriously hard to hire for. They cherrypick devs that are 85-99%[0] of the way to passing the Google SWE hiring bar, who in addition have ops experience.

The Google SRE head honcho says: [1]

>Fundamentally, it’s what happens when you ask a software engineer to design an operations function.

So I think the short answer to your question is yes, they hire software devs and train them [2], but require them to already have some ops/Linux internals knowledge.

[0]: SRE Book

[1]: https://landing.google.com/sre/interview/ben-treynor.html

[2]: https://www.usenix.org/conference/srecon15/program/presentat...

mehta · on April 11, 2016

They kinda do. Over time, the skills of a dev and SRE diverge a bit but at the core they are the same. I am a dev at Google but went through a program called Mission Control[1]. (It was a great learning experience and helped me do my job as a dev much better.)

As others have said, there is a pretty intensive training that you go through and obviously you learn a lot when you are handling things...

[1] http://googleresearch.blogspot.com/2012/07/site-reliability-...

fishnchips · on April 11, 2016

You're quite right. I'm an ex-Googler and SRE was the last thing I did at Google. As a first-and-foremost programmer I was bored out of my wits and frustrated by the fact that while I was writing repetitive monitoring rules actual development was happening somewhere else and I had little impact on the design of the system. Granted, a lot of folks over in SRE loved it but I would venture to say that most of them were much closer to being a sysadmin than programmer. Folks like myself tended to 1) move to SWE, 2) quit and 3) some endured and became managers over time.

sciurus · on April 11, 2016

Although I can't find the link now, I've read that there is a pretty intensive training process for SREs. This would make sense since Google's infrastructure is so unique.

huangwei_chang · on April 11, 2016

I think you are right. There is a lot of training after one join the team. The job does require very different skill sets.

etcet · on April 11, 2016

From my limited experience, they want devs. The initial phone screen is all sysadminy questions (e.g. what commands show you i/o usage?). The second screen is a coding interview. I'm a sysadmin who works in bash every day. The coding questions are mostly related to log parsing. My code interviewer didn't seem to even acknowledge that bash was a programming language and did not understand a simple grep and sort pipeline.

Edit: Sorry everyone! I totally got my LinkedIn and Google interviews mixed up. What I described above is my LinkedIn experience.

thesnider · on April 11, 2016

There are actually two different SRE roles: the one people are describing above where you are 85-99% of the way to SWE (Software Engineer), and you have sysadminy experience, and another one where you are 100+% of the SWE bar and optionally have sysadminy experience. The former is called SRE-SE (Systems Engineer), and the latter SRE-SWE.

SRE-SE interviews are super heavy on the sysadmin stuff usually, with less (but still significant) attention paid to SWE skills, whereas SRE-SWE interviews may not even have an SRE component (it's possible for candidates in the 'normal' SWE hiring pipeline to be shunted to SRE-SWE post-interview).

thrownaway2424 · on April 11, 2016

Yeah a lot of people don't understand this distinction. You have your pure SWEs who were hired that way who then were either picked for or switched to SRE-SWE. Then you have people who were recruited into SRE-SWE from the beginning. People in SWE and SRE-SWE job classes can freely move between them. Then finally you have people who were recruited as SRE-SysEng, or were recruited as SWEs and didn't quite make the cut. These folks have to do a transfer interview to jump to the SWE or SRE-SWE roles.

bogomipz · on April 11, 2016

Could you speak to the dev/programming skill set need for the SRE-SWE vs the SRE-SE?

daave · on April 11, 2016

I'm an SRE-SE and regular do phone interviews for SRE-SE candidates.

While I do tend to spend more of the interview time talking about sysadmin tools, operating systems, networking, databases, security and troubleshooting, I still expect candidates to have reasonably good coding chops.

The difference is that the coding questions tend to be more task-oriented or procedural (i.e. log processing, building automation pipelines, implementing standard unix cli tools, etc.), rather than the algorithmically challenging or math-oriented problems that we'd usually ask SWE candidates.

Both the SE and SWE side SRE candidates need to be able to design and reason about large systems, making trade-offs between performance (especially latency), redundancy and cost.

bogomipz · on April 12, 2016

Thanks. Is being well-versed in C a prerequisite for the role then? I'm imagining you need to be fluent in at least one statically compiled language or ???

daave · on April 13, 2016

In my interviews you code in whichever language you prefer. Some interviewers will ask you to use a specific language that's mentioned on your resume. In general I think that if you show strong coding skills in some language, it is believed that you won't have much trouble teaching yourself the languages your team uses (typically some subset of C++, Java, Python, Go, Borgmon).

pja · on April 11, 2016

Weirdly my experience with interviewing for an SRE role was the reverse: the phone screen was pretty much a straight programming problem (in a Google Doc, ugh) but then one of the on site interviews was very shell programming focused.

I guess to a certain extent it must be the luck of the draw - mass log analysis did crop up in my on-site interview as well, so that seems to be something of a theme, although it was more from a high level systems building POV for me.

mquander · on April 11, 2016

I interviewed as SRE four months ago and I got a very typical programmer phone screen and interview, just exactly the kinds of whiteboard CS questions you expect Google might ask. (My bigger strength is on the programming side instead of the ops side, so I think perhaps they explicitly tailored the interview to measure that.)

mason55 · on April 11, 2016

That's interesting, when I interviewed (admittedly it was like 8 years ago), the first interview was standard CS questions and the second interview focused on deeper C++ questions (things like analyzing how a vtable is constructed).

bovermyer · on April 11, 2016

I don't work for Google, but I did follow that career path - started as a developer, then moved into operations. I didn't immediately become an expert as soon as I made the transition - it was a slow process, and I'm still not what I'd consider the equal of a veteran sysadmin in terms of low-level system knowledge. However, I can hold my own in either discipline.

nunez · on April 11, 2016

They look for someone that is strong at operating systems and networking knowledge AND someone that's a decently strong programmer. IOW, someone that can pass a low-level ops interview and a Google SWE interview.

This is a tough position to fill but I've worked with folks who are genuinely good at both, so they're out there.

slyall · on April 11, 2016

What I'm wondering is have they created a model that only works if you employee "top 5 percent" people.

Sort of like if long-haul truckers needs the skills and training of Fighter pilots.

tamana · on April 12, 2016

Yes. What you get is a lot of "grunge work" that never gets done because everyone is busy inventing the next gen cure-all.

ipsum2 · on April 11, 2016

> Candidates should be able to pass or nearly pass normal dev hiring bar, and may have some additional skills that are rare among devs (e.g., L1 - L3 networking or UNIX system internals).

They hire devs with ops experience.

bogomipz · on April 11, 2016

Can you or someone else possibly say what the "normal dev hiring bar" is?

I've enjoyed what I've read so far in the book and I get it that there is an opportunity here for Google to market the SRE role b/c they are apparently hard to fill but why didn't they put a chapter in the book of what that dev skill set is?

Or else what dev skill are expected of someone working in this capacity.

Seems like lifting the shroud of mystery would do wonders for marketing a role that is hard to fill.

thrownaway2424 · on April 11, 2016

SRE-SWEs at Google are expected to be able to design and implement the same things that the developers they support are working on. If you are on Blah-SRE you would be expected to be capable of at the very least reading and understanding the code that Blah-Team writes, reviewing their code, etc. You might not _actually_ work on that code (it might be more important for you to be working on the automation, monitoring, or whatever) but you'd be qualified to do so. That's why the SRE-SWE interview is basically the SWE interview: graph coloring and complexity analysis and all that.

bogomipz · on April 11, 2016

In C, Java, Python?

thrownaway2424 · on April 11, 2016

Good question :) SRE-written system are never or almost never written in Java. However many SRE teams support Java systems.

Literally nothing at Google except third-party code (like Linux kernel) is written in straight C. C++ is very common, though.

Python sucks and is useless. Most things that were written in python are now written with Go.

I guess your question is about what happens if a C++ expert ends up on a Java team, but I don't really know. I happen to have landed in C++ world, which suited me.

bluecmd · on April 11, 2016

> Python sucks and is useless. Most things that were written in python are now written with Go.

This is simply not true in any way.

packetslave · on April 12, 2016

True in my area as well. The only new things being written in Python are things that need to interop with other systems that have giant existing Python libraries.

thrownaway2424 · on April 11, 2016

Well it's a big company. In my area people who propose writing new code in python are either gently corrected or heartily derided.

morgosmaci · on April 11, 2016

I was an SRE who was most comfortable in C++ who changed ladders to a SWE team who was primarily Java. To be honest it was harder to learn all the team specific stuff than pick up Java.

The reason Python tends to suck is not all really nice libraries are ported to it in any reasonable amount of time. (If you need it, you end up doing the work). Go sucks in new and totally different ways.

bogomipz · on April 12, 2016

I'm imagine an SRE needs to be fluent in at least one statically types compiled language. Would Go fulfill that in terms of getting an interviewed or hired or does it need to be C/C++?

bluecmd · on April 11, 2016

Shouldn't matter. You should be capable of picking up new languages. You do the interviews in whatever language you prefer.

toomuchtodo · on April 11, 2016

Can't be good at everything. Only so many hours in the day.

DanHulton · on April 11, 2016

How are people reading this? Are you all lucky recipients of limited preview copies, or is this actually out somewhere I'm not seeing?

jonkiddy · on April 11, 2016

The first 11 sections are available for preview.

https://books.google.com/books?id=tYrPCwAAQBAJ&source=gbs_bo...

DanHulton · on April 11, 2016

Thanks, this is exactly what I was missing.

geerlingguy · on April 11, 2016

Looks like it's up on O'Reilly's site: http://shop.oreilly.com/product/0636920041528.do - also on Amazon.

beambot · on April 12, 2016

I love the irony... another story on HN frontpage: "Google Compute Engine (GCE) down in all regions"

https://news.ycombinator.com/item?id=11476786

jrockway · on April 12, 2016

From the linked notes: "Error budget. 100% is the wrong reliability target for basically everything."

ricardobeat · on April 12, 2016

If they target 99,99%, this outage already blew away their budget for the year.

jrockway · on April 12, 2016

The SLO is 99.95%.

zhixingchou · on April 12, 2016

。。。