Granted, I'm only on the fifth chapter currently, but this is the first IT-oriented book that I've genuinely had a hard time putting down. It's so exciting to be able to see so many components of such a successful software (and hardware) organization, and as they mention, to see the reasoning behind the decisions rather than just a dump of "here's what we decided."
This book is great. I highly recommend to anyone architecting modern, cloud-based applications. At the very least it introduces a lot of concepts that many different open source projects implement. It'll give you the "why", when you're looking at the "how".
I am also enjoying the book currently, I really do like it although it's driven home the fact that I cannot get by (as a systems engineer) without low level programming knowledge and experience.
it's a strange feeling seeing the end of the line for your skills that took a decade of slow grinding to acquire.
I'm not sure I agree that you need low level programming knowledge and experience. The industry is huge, and there's needs for a lot of talent with a variety of skills.
If anything, I would argue that the need for the low level languages is disappearing for the general case. As "The Cloud" gets bigger and bigger, and the low level things are handled more and more by a service providers, and they abstract more and more of the day to day, it makes it easier for you to focus on higher level problems.
I do think that the days of being purely ops and needing nothing but shell scripting are going away though. You gotta pick up some python or similar at this point.
> I do think that the days of being purely ops and needing nothing but shell scripting are going away though. You gotta pick up some python or similar at this point.
Maybe 5-10 years down the line, but not this year or next.
I was feeling the exact same way about a year ago. The thing is, once you pick up a programming language, all of your other hard earned Ops skills are still very relevant and will become much more valuable.
Without learning some programming you will go the way of the dinosaur, but by learning a bit you can become significantly more effective than you were before.
I don't necessarily think we're at the end of the line for any skills, just perhaps at the beginning of the line (only time will tell) for a new type of split. Either way, existing knowledge seems not to be EOL, but rather complements skills that were traditionally kept separate.
Additionally, it always takes plenty of time for legacy systems and processes to become forgotten, so even for those who can't or simply don't want to adapt, I feel like there's plenty of work still out there at places where change must occur more slowly, or where long-term investments have been made.
I'm a freelancer who always winds up as "the ops person", and I'm just waiting for my copy of this book to come from Amazon. I watched a video of a talk about this book---sadly I can't find it on Youtube anymore. A HN comment linked to it last week. It was interesting but I'm hoping the book will have a lot more meat.
A few thoughts:
> Google places a 50% cap on the amount of “ops” work for SREs: Upper bound. Actual amount of ops work is expected to be much lower
I didn't catch the "upper bound" part from the talk. Good to know! I really enjoy being a developer-who-does-ops. I wouldn't want to be a sysadmin, and 50% ops is probably my limit for happiness.
> I don’t really understand how this is an example of circumventing the dev/ops split
I felt the same way from the Youtube talk. I think there must be a lot behind the SRE role that makes it successful or not: culture, policies, who you hire, how you train, etc. Also I feel like the best sysadmins have been encouraging coding and automation for a long time, e.g. Thomas Limoncelli. But I've certainly been on the "dev" side of the dev-vs-sysadmin fight before, and it makes sense to be seeking ways to improve things.
> Error budget. 100% is the wrong reliability target for basically everything
I think I saw just this month that Google Apps uptime is 99.95%? Some major Google service. I remember in the early 2000s everyone cared about "5 9s", and I feel like for most of us that is just not worth the effort.
> Chubby was so reliable that teams were incorrectly assuming that it would never be down
This reminds me of Nygard's point in Release It! that your theoretical best SLA is the product of your dependencies' SLAs, e.g. 0.999 * 0.999 = 0.998. But in the world of microservices, this logic seems likely to make you underestimate your uptime.
Also I think Feynman's remarks about the Challenger accident apply here: if you are building a new product with, say, 5 microservices, you don't know the reliability of any of them yet. It's dubious to estimate low-frequency events based on "it's hasn't happened yet."
Thanks for sharing your notes. I'm envious you've already got a copy. :-)
Regarding "theoretical best", I think that is "in the absence of mitigations". I think you can build a service with a higher SLA than one of its dependencies, but only if you recognize that impedance mismatch and build in defenses.
As a contrived example, if you've got a microservice that provides data FOO about a request that isn't actually end-user critical, you can mitigate your dependency on it by allowing your top-level request to succeed even if the FOO data is missing. Or maybe you can paper over blips of unavailability with cached data.
But, yes, know what you depend on and how reliable they are, then see if you need to take more action than that if your target is higher than the computed target.
(Tedious disclaimer: my opinion only, not speaking for anybody else. I'm an SRE at Google)
Building reliable services out of unreliable dependencies is a part of what we do. At the lowest level, we're building services out of individual machines that have a relatively high rate of failure, and the same basic principles can be applied at every layer of the stack: make a bunch of copies, and make sure their failure modes are uncorrelated.
In the early 2000s all the cool people were running Sun E450s, mod_* and Oracle, which makes 5 9s a little more reasonable than we see today with multiple layers of things to go wrong on less-managed platforms. I'm not throwing shade on any particular technologies, I'm just saying that the deployment and configuration management side of things these days is more complicated.
I don’t really understand how this is an example of circumventing the dev/ops split.
My understanding (if I understood correctly), from talking to a friend who is an SRE, is that SREs are also part of the design process. The developers want resources, so they contact and work with SRE teams to make sure their project is both planned for in capacity and can be efficiently served. If it can't be served, maybe another component needs to be deployed that makes the data efficiently usable for the new app or feature (I'm unsure on this, but it sounded like it may have been implicated).
That is, SRE teams become devs of certain components of the project, and work to support the project when in development. This should defeat some of the dev/ops split, because SREs also work on the same project, and are invested in its launch and success.
When it comes to alerting, yes. I've seen it tried many times by competent engineers. The problem is that once you get beyond toy examples into situations with even a mere 10k time series there's so much noise that you can't get any useful signal.
> We could really use something like Outalator, though.
I've not found anything like it yet, unfortunately. Hopefully someone will be inspired to write one by the book, there should be enough detail there to do it.
> the request was rejected because the error case should never happen.
I haven't run into this mindset much at my current job. But in general I think I've been able to lobby for "well, can we at least have a special case that would leave a breadcrumb behind if it does occur?" That way the investigation when it does inevitably occur is swift and there's less debate among ambiguous choices about how to change the design going forward.
I've also found fault injection testing as a great way for disproving statements about what "can never happen."
That said, I've seen the other extreme too -- checking pointers against NULL just prior to dereferencing at every opportunity up and down the stack. In these cases function/module authors succeed only in moving the eventual crash to somewhere far disconnected from the origin of the problem.
What i find interesting about how Google/AWS/Netflix are setup is their interesting line between ops and devops. Development teams are expected to run their own services - but after a while. SREs are there to help make the transition. They are the ops experts. I think for smaller shops and startups, there is an important lesson - don't throw out ops! Your in-house ops people are Google's SREs. Most devs I work with could not run services in production without significant help. Google, etc, have great structures in place to handle this. Smaller shopps should take care.
I don't think the trend toward having devs handle all the ops is a good idea. Ops teams, for good reason, tend to be way more conservative than devs. Plus I don't think it is effective to constantly interrupt dev teams with operational issues.
Interrupting them is what pushes them to push quality code not just throw it over to ops and hope for the best. It works quite well with deployment velocities that would shock even normal devops shops.
Agreed, but there is definitely a trend towards pushing devs to think more about ops. We internally use chef/kitchen and get devs to run their code on kitchen unit tests before it can be considered ready for a PR. Previously, we only did unit tests, and it's a great improvement. But, yes, the bar is much higher from that to actually running in production.
In AWS, if anything, it's the other way around. Devs always run their own services at first, and ops teams start helping out once the service reaches some kind of scale where a dedicated ops team makes sense. (source: I work for AWS)
Unfortunately shipping at O'Reilly is $49 for my address in europe - literally more than the book + ebook bundle. Amazon on the other hand has the book for €30 and free shipping.
I always wonder why companies like O'Reilly think there is no decent market for them here, would have loved to order directly from them and get the ebook as well.
I didn't even work out what the Oreilly shipping was since it required me to login and go right to the end of the ordering process.
I tend to avoid ordering books from Amazon too since they charge $5 per book plus $5 per order which usually makes them uncompetitive. Strangely enough I end up ordering most of my books via the Book Depository which is owned by Amazon anyway.
Bought this book last week and intend to get through it shortly. But, I'll also plug this excellent paper by James Hamilton (of AWS): "On Designing and Deploying Internet-scale Services" - https://www.usenix.org/legacy/event/lisa07/tech/full_papers/...
```I don’t really understand how this is an example of circumventing the dev/ops split. I can see how it’s true in one sense, but the example of stopping all releases because an error budget got hit doesn’t seem fundamentally different from the “sysadmin” example where teams push back against launches. It seems that SREs have more political capital to spend and that, in the specific examples given, the SREs might be more reasonable, but there’s no reason to think that sysadmins can’t be reasonable.```
Seems we don't understand the point of SREs at all.
In a world where "ops" and "dev" are split and sysadmins occupy "ops" it is customary that system admins are not programmers, may not know how to program, do not venture into the VCS for the codebase and may not even have rights to check-in code to the software. It would certainly be unusual to see check-ins to the codebase from the ops/sysadmin team.
This leads to the situation where you have 10 year old codebases running on 10 year old frameworks on 10 year old operating systems. The system admins are naturally tearing their f---ing hair out over this situation. The devs, however, have the platform on life support and are off writing new code for shiny systems because that's a lot more interesting and useful than keeping the old garbage on life support. No progress is made and the problem typically doesn't resolve itself until the old systems become sufficiently problematic that the devs rewrite the entire system.
If you have an SRE model it should never get quite this bad. The Devs will support the code in production until they're ready to hand it over for maintenance. When it is handed over the SREs get all the keys to the kingdom and have the rights, responsibility and ability to fix bugs in the software they're running.
If you have legacy codebases run entirely by ops people who don't have any ability to maintain the codebases then you aren't doing SRE.
This is one of the manifestations of the "chinese wall" between ops and dev (which is what "DevOps" and "SREs" are entirely antithetical to--it may be hard to define what those terms /are/ but it is pretty easy to define some patterns that they definitely /are not/). If "Ops" has to come begging to "Dev" to fix their software then you're not doing it right.
> When it is handed over the SREs get all the keys to the kingdom and have the rights, responsibility and ability to fix bugs in the software they're running.
Bug fixing is still the responsibility of the developers, which isn't to say that SRE won't help out at times but it's not their role.
> If you have legacy codebases run entirely by ops people who don't have any ability to maintain the codebases then you aren't doing SRE.
An SRE is not a maintenance engineer. Service ownership is always a partnership between SRE and developers. If there's no developers, then there's no SREs.
And in fact, it's common for SREs to hand obsolete services back to the dev team if they've been mostly phased out to the point where they're no longer the primary priority.
> First, I normally take pen and paper notes and then scan them in for posterity. Second, I normally don’t post my notes online, but I’ve been inspired to try this by Jamie Brandon’s notes on books he’s read. My handwritten notes are a series of bullet points, which may not translate well into markdown. One issue is that my markdown renderer doesn’t handle more than one level of nesting, so things will get artificially flattened. There are probably more issues. Let’s find out what they are! In case it’s not obvious, asides from me are in italics.
This is a problem that would be overwhelmingly solved by org-mode in Emacs. Writing the summary in org-mode means you're a 'C-c C-e' away from exporting to HTML or LaTeX.
I don't know anything about Google and its SREs. Do they really just hire software devs and expect them to be good at Ops? The opposite, hiring a sysadmin and then assigning them to a software team to develop a product, seems equally problematic.
They want a cross between a developer and a sysadmin.
When I was there they hired people with one background and hoped they would be capable of the other. This didn't work out for about half of them. Making things worse, they had no procedure to say, "OK, we hired someone we think is good, but this is clearly the wrong role for them." Which lead to losing a lot of people who had potential.
Making it personal, I interviewed for a pure dev position then was offered a job as an SRE. It turns out that I don't make a good sysadmin. After I left, I was amazed at how many people I met knew of someone else who had had the same experience.
Ironically recruiters see a year as a Google SRE on my resume and think I might want to join SRE teams at other companies. Most of that stopped after I changed my Linked In profile to say that I am never interested in SRE jobs.
Explaining the concretes would be a lot of unpleasant detail. The why is far more interesting.
I am very good at "tunnel vision". Taking a thread and following it in depth. Completely learning a system in depth so that anything that comes up, I know exactly where to go and what to do.
I am weak on "peripheral vision". Keeping track of 15 balls in the air, and operating with limited information about each. Being effective with limited context about each individual system because there isn't time to learn any of them in depth.
Tunnel vision is a virtue in a software developer. Peripheral vision is essential for a sysadmin.
Now imagine a person with poor peripheral vision trying to learn a system as big and complex as Google's. And being responsible to support 15 different pieces of software written by 15 different teams so that when anything went belly up with any of them, you can trouble-shoot and get it running again.
Nothing particularly bad happened, but I also wasn't accomplishing the job to the standards that Google wanted in that role. And I was not the only person on my team failing in that way.
Ideally it would have been my manager's job to say, "I recognize that this person isn't working out here, is there a better role for them?" That didn't happen. Several months later my manager got fired. I was privately told that my situation was a trigger, but I don't know details.
The fact that my situation is fairly common strongly suggests that the ultimate failure was organizational and systemic. I don't fault Google for hiring developers into their hybrid developer/sysadmin role. I do fault them for not having an explicit onramp/reconsideration process to mitigate the risk that they create by doing so.
Well that's just the thing - an organization that has a "system" or "process" for handling employees and their work is by design less than flexible. In a small to medium sized business, you might have 3 completely separate roles, and others might pitch in as needed. But at a large corporation, it's simpler and more efficient for them to have one person who does one job. When the peg no longer matches the hole, they are replaced or reassigned. Which is sad, because people with valuable skills are often underutilized. I feel like companies like Valve might have the right idea going forward.
The problem is that larger organizations actually need to have a defined process of some sort. There is a point beyond which individual judgment doesn't scale.
In general I believe that Google has a pretty good process. However every process has bugs. And I happen to have encountered one where they pick people who are qualified in one role into a more experimental one, and it only sometimes works out for them.
It's notoriously hard to hire for. They cherrypick devs that are 85-99%[0] of the way to passing the Google SWE hiring bar, who in addition have ops experience.
The Google SRE head honcho says: [1]
>Fundamentally, it’s what happens when you ask a software engineer to design an operations function.
So I think the short answer to your question is yes, they hire software devs and train them [2], but require them to already have some ops/Linux internals knowledge.
They kinda do. Over time, the skills of a dev and SRE diverge a bit but at the core they are the same. I am a dev at Google but went through a program called Mission Control[1]. (It was a great learning experience and helped me do my job as a dev much better.)
As others have said, there is a pretty intensive training that you go through and obviously you learn a lot when you are handling things...
You're quite right. I'm an ex-Googler and SRE was the last thing I did at Google. As a first-and-foremost programmer I was bored out of my wits and frustrated by the fact that while I was writing repetitive monitoring rules actual development was happening somewhere else and I had little impact on the design of the system. Granted, a lot of folks over in SRE loved it but I would venture to say that most of them were much closer to being a sysadmin than programmer. Folks like myself tended to 1) move to SWE, 2) quit and 3) some endured and became managers over time.
Although I can't find the link now, I've read that there is a pretty intensive training process for SREs. This would make sense since Google's infrastructure is so unique.
From my limited experience, they want devs. The initial phone screen is all sysadminy questions (e.g. what commands show you i/o usage?). The second screen is a coding interview. I'm a sysadmin who works in bash every day. The coding questions are mostly related to log parsing. My code interviewer didn't seem to even acknowledge that bash was a programming language and did not understand a simple grep and sort pipeline.
Edit: Sorry everyone! I totally got my LinkedIn and Google interviews mixed up. What I described above is my LinkedIn experience.
There are actually two different SRE roles: the one people are describing above where you are 85-99% of the way to SWE (Software Engineer), and you have sysadminy experience, and another one where you are 100+% of the SWE bar and optionally have sysadminy experience. The former is called SRE-SE (Systems Engineer), and the latter SRE-SWE.
SRE-SE interviews are super heavy on the sysadmin stuff usually, with less (but still significant) attention paid to SWE skills, whereas SRE-SWE interviews may not even have an SRE component (it's possible for candidates in the 'normal' SWE hiring pipeline to be shunted to SRE-SWE post-interview).
Yeah a lot of people don't understand this distinction. You have your pure SWEs who were hired that way who then were either picked for or switched to SRE-SWE. Then you have people who were recruited into SRE-SWE from the beginning. People in SWE and SRE-SWE job classes can freely move between them. Then finally you have people who were recruited as SRE-SysEng, or were recruited as SWEs and didn't quite make the cut. These folks have to do a transfer interview to jump to the SWE or SRE-SWE roles.
I'm an SRE-SE and regular do phone interviews for SRE-SE candidates.
While I do tend to spend more of the interview time talking about sysadmin tools, operating systems, networking, databases, security and troubleshooting, I still expect candidates to have reasonably good coding chops.
The difference is that the coding questions tend to be more task-oriented or procedural (i.e. log processing, building automation pipelines, implementing standard unix cli tools, etc.), rather than the algorithmically challenging or math-oriented problems that we'd usually ask SWE candidates.
Both the SE and SWE side SRE candidates need to be able to design and reason about large systems, making trade-offs between performance (especially latency), redundancy and cost.
Thanks. Is being well-versed in C a prerequisite for the role then? I'm imagining you need to be fluent in at least one statically compiled language or ???
In my interviews you code in whichever language you prefer. Some interviewers will ask you to use a specific language that's mentioned on your resume. In general I think that if you show strong coding skills in some language, it is believed that you won't have much trouble teaching yourself the languages your team uses (typically some subset of C++, Java, Python, Go, Borgmon).
Weirdly my experience with interviewing for an SRE role was the reverse: the phone screen was pretty much a straight programming problem (in a Google Doc, ugh) but then one of the on site interviews was very shell programming focused.
I guess to a certain extent it must be the luck of the draw - mass log analysis did crop up in my on-site interview as well, so that seems to be something of a theme, although it was more from a high level systems building POV for me.
I interviewed as SRE four months ago and I got a very typical programmer phone screen and interview, just exactly the kinds of whiteboard CS questions you expect Google might ask. (My bigger strength is on the programming side instead of the ops side, so I think perhaps they explicitly tailored the interview to measure that.)
That's interesting, when I interviewed (admittedly it was like 8 years ago), the first interview was standard CS questions and the second interview focused on deeper C++ questions (things like analyzing how a vtable is constructed).
I don't work for Google, but I did follow that career path - started as a developer, then moved into operations. I didn't immediately become an expert as soon as I made the transition - it was a slow process, and I'm still not what I'd consider the equal of a veteran sysadmin in terms of low-level system knowledge. However, I can hold my own in either discipline.
They look for someone that is strong at operating systems and networking knowledge AND someone that's a decently strong programmer. IOW, someone that can pass a low-level ops interview and a Google SWE interview.
This is a tough position to fill but I've worked with folks who are genuinely good at both, so they're out there.
> Candidates should be able to pass or nearly pass normal dev hiring bar, and may have some additional skills that are rare among devs (e.g., L1 - L3 networking or UNIX system internals).
Can you or someone else possibly say what the "normal dev hiring bar" is?
I've enjoyed what I've read so far in the book and I get it that there is an opportunity here for Google to market the SRE role b/c they are apparently hard to fill but why didn't they put a chapter in the book of what that dev skill set is?
Or else what dev skill are expected of someone working in this capacity.
Seems like lifting the shroud of mystery would do wonders for marketing a role that is hard to fill.
SRE-SWEs at Google are expected to be able to design and implement the same things that the developers they support are working on. If you are on Blah-SRE you would be expected to be capable of at the very least reading and understanding the code that Blah-Team writes, reviewing their code, etc. You might not _actually_ work on that code (it might be more important for you to be working on the automation, monitoring, or whatever) but you'd be qualified to do so. That's why the SRE-SWE interview is basically the SWE interview: graph coloring and complexity analysis and all that.
Good question :) SRE-written system are never or almost never written in Java. However many SRE teams support Java systems.
Literally nothing at Google except third-party code (like Linux kernel) is written in straight C. C++ is very common, though.
Python sucks and is useless. Most things that were written in python are now written with Go.
I guess your question is about what happens if a C++ expert ends up on a Java team, but I don't really know. I happen to have landed in C++ world, which suited me.
True in my area as well. The only new things being written in Python are things that need to interop with other systems that have giant existing Python libraries.
I was an SRE who was most comfortable in C++ who changed ladders to a SWE team who was primarily Java. To be honest it was harder to learn all the team specific stuff than pick up Java.
The reason Python tends to suck is not all really nice libraries are ported to it in any reasonable amount of time. (If you need it, you end up doing the work). Go sucks in new and totally different ways.
I'm imagine an SRE needs to be fluent in at least one statically types compiled language. Would Go fulfill that in terms of getting an interviewed or hired or does it need to be C/C++?