When the oncall gets paged, an SLO should be in jeopardy in a way that requires immediate measures to be taken by a well-trained human as described in actionable terms in a linked playbook.
No SLO in jeopardy, or no immediate measure that needs to be taken? Don't page the oncall; send a low-priority ticket for the service owner to investigate the next business day.
Steps need to be taken, but they're mechanical in nature or otherwise don't give the SRE an opportunity to exercise their brain in an interesting fashion? Replace the alert with an automated handler that only pages the oncall if it encounters an exception.
No playbook, or the playbook consists of useless non-actionable items like, "This alert means the service is running out of frobs"? Write a playbook that explains what the oncall is expected to do when the service needs frobs.
Edit: A dead reply asks if I've ever experienced a novel incident. Of course. Say, for instance, a "This should never happen" error-level log is suddenly happening like crazy, for the first time ever. In that case, you page the oncall, they do their best to debug it, see if they can reach the SWE service owners, read through the code to see if it could be an indicator that SLOs are being violated (e.g., user data corruption) or might be violated soon, and then write a stub playbook to be fleshed out the next business day, probably alongside a code change to handle this situation without spamming the logs so much.
In a previous life as a full-stack Engineer at a startup, this was my white whale. The state of logging, monitoring, and alerting was such that signal quality was low, and only indirect observations of the system were possible since the logging was borderline useless. The result was multiple pages per night, with each one resulting in a scavenger hunt because signal was so low that it was nigh impossible to even identify what playbook to run.
For example, the web application crashing was logged as a DEBUG statement, but starting was logged at an ERROR level. This was clearly done at some point because DEBUG generated far too much log info w/millions of active users, but some Engineer wanted to know that the app started. Gross.
I solved for this by doing a couple things. The first was to define standards for log levels, ability to correlate log statements with each other for a given request, and to define the level of context a "proper" log level should provide.
For example, FATAL = there's no way anything can work properly. These are pretty rare, but incorrect configuration values were a common culprit. ERROR indicates something, possibly transient going wrong. Every now and then, not a big deal that can wait until later, but a rapid accumulation could mean something more serious is going on. INFO contained information about the state of the system, such as general measures of activity and other signals to indicate the system is working as expected. Most of our metrics capture was instrumented based off these statements.
In terms of the messages, we rapidly evolved the quality of the messages. For something like the aforementioned configuration error, the system initially just spat out an "Unexpected error" and a module name. The first improvement then stated something like "invalid configuration value" and finally we ended up on a message that stated the value was incorrect, identified which configuration value was wrong, and had a code that referenced documentation and escalation owner.
When all was said and done, we'd reduced our downtime from hours per year to less than 5 minutes, eliminated over 95% of our pages, and reduced escalations to Engineering from several days per week to a level where it was hard to remember the last one.
As the head of Engineering, I had to fight an uphill battle against the product & sales team for almost a year to make all of this happen, but I was fully vindicated when we were acquired and our operational maturity was lauded during the due diligence process.
Going through something like this as a SWE at a startup. Lots of noise in our alerts and logging, so alert fatigue is a real problem. Do you have any advice on navigating this scenario (esp. negotiating with product to get monitoring and ops in a usable state)
Thanks, this was a very enlightening read. Getting product on board with the labor involved in implementing this is going to be a different story though.
Ultimately, it's Product's job to decide how they want to balance reliability and feature-shipping speed. Work with them to define an SLO (like, in 99.995% of five-minute timeslices of any given month, 99% of all queries will complete within 250msec) and then graph how well you're doing when it comes to hitting it.
If you're failing to keep things above that line, Product either needs to accept lower reliability standards or invest engineering time in improving reliability. Again, it's Product's call to make. If they do want to invest in reliability, though, that's when you get to present your wish list, work out an agreement on its ranking, and find time to get the work done, even if it means slowing down the rate at which new features are shipped.
You may have luck if you frame it in terms of an investment. Spend the time now to fix your alerts, add playbooks, improve process - because you immediately start enjoying the benefits. Less time spent on support means higher velocity. The longer you wait the more engineering time you've wasted It just takes a little patience up front as well as product and engineering collaborating.
> When the oncall gets paged, an SLO should be in jeopardy in a way that requires immediate measures
> No SLO in jeopardy, or no immediate measure that needs to be taken?
A little contradictory here. Maybe your "or" should've been an "and". But anyways:
I can think of several scenarios for which an SLO is not in jeopardy for which you should get paged. All of which boils down to "Yes it is important for the business, our users, our brand" but still not worth having an SLO around because it's either 1) hard to measure or 2) too difficult to implement.
In an ideal world with infinite engineering resources on your team, you could have and _maintain_ an SLO for every part of the business that your system effects. In the real world, trade-offs need to made and certain key SLOs prioritized.
The entire car alarm industry is a scam, promoted by Republican congressman Darrell Issa. It has seriously disrupted our lives in every way imaginable and has drowned out the beauty of nature. I can’t think of a single car that has been protected by a car alarm since they were invented. They are useless and should be banned for the health and safety of mitigating noise pollution.
> It has seriously disrupted our lives in every way imaginable
I assume this is one of those things that changes dramatically based on where you live—for me (western US), this statement seems almost comically exaggerated.
I live in one with more than 2 million people and it's something that I have to think hard about to remember the last time I heard one go off.
Longer if I have to think of one that went off and wasn't some form of 'oh shit oh shit oh shit, wrong button' reaction from the person trying who accidentally turned it on.
> Longer if I have to think of one that went off and wasn't some form of 'oh shit oh shit oh shit, wrong button' reaction from the person trying who accidentally turned it on.
I think this is what makes it such a scam. The amount of false positive alarms makes more people desensitized to them.
All it takes is one neighbor to move in with a finicky alarm and a street parking situation to ruin the peace and quite of this amazingly rare statistical anomaly you reside in.
> I can’t think of a single car that has been protected by a car alarm since they were invented.
Many insurance companies offer lower premiums if you install a car alarm. So I guess they work at least a little, otherwise they wouldn't lower their premiums.
It may not actually stop a thief, but it may get a thief to chose a car that doesn't have an alarm, or maybe it is just a correlation, but there is at least something.
Still, I think they should be made illegal, they are a nuisance, there are already laws against making excessive noise and car alarms should be included. And if they create an arms race, by getting thieves to prefer cars without alarms, that's even more reason to ban them.
>Many insurance companies offer lower premiums if you install a car alarm. So I guess they work at least a little, otherwise they wouldn't lower their premiums.
I can think of another reason.
They run their computations- and say the insurance can be priced at $100. But hey what if we just increase to $105, and then offer a $5 discount to people who have car alarms? We get extra money (average of >100) and people think they are saving money. Who knows, might even be getting some type of kick back from the car alarm industry for promoting them.
Maybe I'm making shit up- but I've grown to hate insurance companies so much that it also makes perfect sense.
Only for the not so lucrative market of car owners without car alarms. Since all new cars are outfitted with one I would guess this would be the n% least willing to spend on their cars.
If you change the requirements for an insurance policy you attract a different risk pool. Just like health insurance companies offering gym membership to find healthier customers.
For poor people whose ability to live depends on having a car, car alarms must be at least sort of useful to know if your car is being stolen at night. I’m sure they’re just a noisy inconvenience to the wealthy though.
Usually when they are going off in my neighborhood the car is streetparked and that's how they end up getting triggered (sometimes a loud motorcycle can even do it). So the owner could be like well over a 10 minute walk away and out of earshot anyhow. As a result most of the time when I hear a car alarm it rings until the thing shuts itself off automatically to save the car battery.
Hard to prove empirically either way, I think. Even if they do, it's then arguable if this is worth the noise pollution etc. Probably depends on the circumstances in any case, so any empirical result would be too specific to be of any use. Science!
A thief can disable your blaring car alarm in like 30 seconds. Enough time for anyone listening to go "wow someone must have hit the panic button in their pocket" and move on with their life.
Smoke(/fire in general) alarms are not a good example of a thing with high specificity. You perceive it that way, but what you see is the result of somebody getting paged about it and then checking (preferably physically, but also through eg. CCTV) whether there really is an emergency situation and canceling the alarm before its escalation timeout. Apparently, for typical commercial building false fire alarms are more or less an weekly occurrence.
Edit: in large scale fire alarm systems there also are rules about combinations of triggered sensors that cause immediate escalation (if there is smoke and elevated temperature in two adjacent zones, it probably is not a false alarm and such things, often it even takes into account the failure modes of the physical alarm loop wiring). This is an interesting idea for IT monitoring: page someone only when multiple metrics indicate an issue.
It was an interesting example and maybe deserved a few more caveats to actually serve the point. After all, we've all heard a fire alarm of some sort in the past year (if not the past month) but how many were actual fires? (Technically the author said smoke which helps but not really.)
Where I was expecting the author to go:
- Clearly was talking about residential smoke detectors, not commercial. That could have been explicit.
- Smoke detectors do have a high false-positive rate but almost always at the right time. A home smoke alarm going off while I'm cooking is quite different to a smoke alarm going off when I'm sleeping. To the author's point, there are very few false positives while I'm sleeping so when they happen, I'm getting up.
Speaking of the commercial context, I wonder what sort of businesses would get a lot of false alarms and how that varies across industries.
I have been _plagued_ by smoke alarms that treat a low battery as a sign of a fire. To the point that I am trained by their crying wolf that it is _always_ a false alarm. Particularly when I'm asleep and they've gone off.
I would actually prefer if rules mandated they could only have large capacitors and just NOT CARE if the power goes out.
Next would be to require a sensor pick up an area that's IR hot AND smoke to go off. I'm sick of bathroom steam sometimes setting them off too.
Finally, ONLY FOR EMERGENCY would the loud and annoying cry be allowed. Tests, low battery, anything not indicating a clear and immediate threat to life should be a low noise, low light, indication. Maybe a 2 second low-quality sound clip that says 'bat' at a soft voice volume with a strobe at the end of the voice (when a human would be looking for the noise). Fog/Steam/etc, E.G. possible fire without detected heat but at a weak detection level, could also use the 'info' level of alert, not the DANGER level.
I’m at the point where a low battery in a smoke detector triggering a fire would be an actual upgrade. Have tried cheap, expensive, and multiple models. Currently have zero active and about 8 on a shelf. They randomly go off even with new batteries. My house isn’t on fire.
The post is somewhat incomplete without also discussing the cost of the wrong decision.
You obey the smoke alarm because the cost of ignoring the alarm when it is a true positive is potentially infinite (you die). You ignore the car alarm because (1) most likely it is a false positive but also (2) most likely it is somebody else's car.
I do like how the author presents the case for how damaging false-positives can be in SRE monitoring. But, FYI, it can get worse if these monitors are hooked to self-actuating feedback loops! I recently wrote about a production incident on the Heii On-Call blog, in the context of witnessing how Kubernetes liveness probes and CPU limits worked together to create a self-reinforcing CrashLoopBackOff. [1] Partially because the liveness probe thresholds (timeoutSeconds and failureThreshold fields) were too aggressive.
We have a similar message about setting monitoring thresholds in our documentation [2] because users have to explicitly specify a downtime timeout before they’re alerted about their website / API endpoint / cron job being down. The timeout / "grace period" is necessary because in many cases a failure is some transient network glitch which will fix itself before a human is alerted.
If you make the timeout too short, you’ll get lots of false positive alerts, and as the article says, your on-call engineers will be overwhelmed or just start ignoring the alerts.
If you make the timeout too long, it just takes that many minutes of downtime longer before you find out about it.
It may sound counterintuitive, but the latter is usually preferable. :)
So this is a pretty common cascading failure scenario. Even ignoring CPU limits, if your service gets slow when it's over capacity, this will almost always happen. Latency increases to the point where liveness probes fail, causing the size of the fleet to decrease because of liveness-induced restarts, causing the other replicas to experience more load, causing them to become slow enough to fail liveness probes, and soon enough, everything is dead.
Kubernetes can only do so much for you here. Liveness probes are designed to restart categorically broken software; for example, a combination of two requests causes no further requests to be handled. Maybe that's rare enough that a simple restart is an improvement over a replica that times out all requests directed at it. (You can fortunately see this behavior in real-world scenarios. You can also architect your application to self-check, of course, but the common "if path == '/healthz' { response.WriteHeaders(200) }" isn't this.) Readiness probes can shed load, but only by loading the other replicas by taking this replica's endpoints out of the service until things calm down. If the system as a whole doesn't have enough capacity, then picking one replica and saying "you can rest for 5 minutes" is just going to cause the other replicas to become overloaded and for the whole system to eventually fail.
There are other techniques here that work better.
Rate limiting is very common inside Big Tech; when a calling service induces too much load, it's told to simply go away via a fast path. That can prevent the thundering herd by allowing a % of requests to make progress, while other requests are rejected. Some progress is made while the system is degraded, and if there is spare capacity and a buffer, eventually the buffer is drained. (This post is too long to rant about buffering in distributed systems and what backpressure is, but if a buffer size of 1 can become full, then a buffer of any size can become full. So buffering is rarely a solution, but often the cause of outages.)
Circuit breaking is also common, where when a significant fraction of requests end with 5xx (usually a timeout), the load balancer just fast-paths a 5xx response for that replica's share of requests. This actually reduces load on the system, allowing it to process some requests instead of becoming a fleet of replicas in CrashLoopBackoff.
CPU limits are another complicating factor, but not much of one. Every piece of software runs with a CPU limit; only a finite number of CPUs can fit in your data center, or the Universe for that matter. A common problem that people run into is multithreaded software that doesn't understand that it's CPU limited. This does not cause failures, but typically induces a weird tail latency. CPU limits are enforced at discrete intervals; every 100ms, you're allowed to use 1 CPU. But you're also allowed to use 10 CPUs every 10ms, and sit idle for 90ms. (The system will enforce this; you may want to do work on 10 CPUs, but you're going to sleep after that first 10ms burst.) Usually, your system can be architected with CPU limits in mind; for example, by setting something like GOMAXPROCS to the CPU limit instead of the number of physical CPUs, avoiding the ability to consume the time allotted before the accounting interval ends. But, these mistakes very rarely lead to cascading failure, just very confusing 99.9%-ile latency numbers when under load, and a request spans that forced-idle interval.
Anyway, I have laid all of this foundation so I can get to my rant. There are a lot of "Kubernetes best practices" out there, and two that I have run into are that all applications must have a liveness probe, and that all applications must run at a Guaranteed QoS (and have cpu request == cpu limit != 0). These are interesting things to think about, but not a guaranteed way to enhance reliability (or lower cost). Your workload might be burstable, in which case a Burstable QoS might be exactly what you need; you trade reliability (a guarantee that all containers will be able to use a certain amount of CPU) for efficiency (you can dip into foobar service's CPU shares when barbaz needs to do a rare high-CPU activity). Liveness probes can be good too, where you have a single-threaded event loop that can get wedged accidentally, and restarting is the only way out. But, neither practice can be blindly applied to every workload that can be run in a container.
I think this article is missing the forest for the trees.
The article is about finding the appropriate sensitivity of alerts on some signal in order to maximize the predictive value.
But you should care more about the quality of the signals you are monitoring than about the sensitivity of your thresholds.
The article mentions load-average as an example signal, but to me, that's a poor signal to monitor. Instead, if your SLO is defined for error rate, alert on error rate.
Alerts on your SLO will have a high predictive value for predicting violations of your SLO, by definition. The tunable parameter here is the time window, not the threshold. E.g. if your error budget is defined for a 30d window, you may want alerts at the SLO threshold for 24h and 1h windows.
> But you should care more about the quality of the signals you are monitoring than about the sensitivity of your thresholds.
This is so true. Case in point: Growatt inverters have - like every other inverter - a maximum voltage on the grid connection at which they will shut down. They're pretty trigger happy about this and fail to take into account the resistance of the feed wire of the inverter to the (much lower impedance) grid hookup. As a result even on cabling sized properly for the interconnect they tend to falsely trigger well before the point where they should. The only way to avoid this problem is to either hack into the inverter somehow (which I've so far failed to do) or to use oversized cables (which isn't always an option).
The sensitivity is fantastic, the quality of the signal is hopeless. Obviously they err on the side of caution but the margin is so ridiculously large that you end up losing a lot of usable power for no reason at all. At least it should allow for either a resistance for the interconnect to be specified so that it can take into account the voltage drop across that wire, which at 10A is appreciable for even short runs of fairly beefy cable.
It's a constant pain of mine to try to get people to stop having business as usual or successfully completed $PROCESS emails come out of our batch processes on our teams at work. They absolutely drown my inbox so I'm forced to filter them then the actual failures get buried in the unchecked "batch spam" folders.
I had a boss who had an inbox with literally hundreds of thousands of unread emails. A good chunk of those emails were "success" messages from batch processes.
It's quite correct to send a "success" message when a batch process is completed successfully, but it's quite wrong to send that message to a human. It should be sent to a machine that should translate a missing success message into an error message/alert for humans to respond to.
For example, I have a set of nightly backup jobs. The last step of each backup process is to send a success message to my monitoring system. I only get a "Missing Backup" alert when the monitoring system detects that it didn't receive the success message it expected for a particular backup.
My old boss didn't seem to understand the concept that people don't generally notice missing messages. Or he was too lazy/incompetent to use a monitoring system that could translate gaps in successes into errors.
Even that is utterly unnecessary because we use ControlM for basically all of the batch work in my area that I know of and there's already automation that opens an Incident on a job failure that can flow into the whole on call system! If a job or cycle is critical and needs to finish by a certain time you can setup messages to go out at that time and everything.
My pet peeve is these $PROCESS notifications that go to slack channels. I worked at a company that had an #engineering_humans slack channel because we got chased out of #engineering by bots.
I'm fine if they go to THEIR OWN slack channel. Then I can mute or leave that channel.
Of course, it's a different problem if those notifications have a mix of actionable and non-actionable messages (e.g. both success and error messages). Then it's a signal/noise problem.
The one that pushes buttons is the alarms that have no docs attached so when they blow off at 2AM, they just get muted until someone comes in and complains at 6AM.
I need to sit down and go through the math again, I got lost in the middle somewhere. All I know is our alerts are way too noisy now to the point where they are useless.
Yes! This. This has happened to me at least two previous companies I have worked at. Everybody sets up thresholds on every possible Datadog metric and alerts become useless. That's part of the ethos of monitoring at my current company. We only set up alerts through https://heiioncall.com/ that we are convinced you absolutely need to look at right now. Anything that is not that gets shoved to a slack channel (that I have long since muted).
I dunno, article doesn't seem to want me to understand. It's just another, "here's a random stats calculation you cant perform in your head, isnt the english a bad way to describe this calculation?!?!? your intuition sucks when i dont explain myself....."
Both of these are exactly the kind of problem where our AI future is going to deliver cost effective modern alternatives. Primitive sensors wake up more sophisticated analyzers and use deep sensors (including video) to determine if there is a real problem.
Witness companies like Rivian triggering car alarms on aggressive behavior detected from ML on video. Don't even need to touch the car.
In any case, not all signals are the same. Most systems have a lot of components interacting and what turns to be dangerous is usually a combination of factors, but in the end, what defines that it was or not is that the system is doing what it should. You can put some guessing thresholds, but you must contrast it with that the system works.
And they should be actionable too, at least for alerts instead of slow day notifications, or metrics giving context to perceived problems that could take out the guessing from the thresholds.
I'd like to know more about the chip designer who, perhaps unwittingly, created the alarm-filled soundscape of most American cities https://youtu.be/tmCnleSBAIg. Would love to know more about the composition process that went into it.
Now, if you're annoyed by the false positive rate on your actual smoke alarms, go replace the one nearest your kitchen with a photoelectric type, not the standard ionization type that's cheaper, the default style installed, and ought to be illegal in homes (IMO).
There's been quite a bit of research done, generally easy to find if you look, that talks about the difference and tests them, but the short summary:
- Ionization type sensors detect the products of fast flaming combustion and "things cooking in the kitchen." Your oven, if a bit dirty, will reliably trip an ionization type. They are quick on the draw for this. The downside is that they're very, very poor at detecting the sort of slow, smoking, smoldering combustion that is associated with house fires that kill people in the middle of the night.
- The photoelectric type is very good at detecting smoke in the air - but it isn't nearly as prone to false triggers on ovens, a burner burning some spills off, etc.
They've been A/B tested in a wide variety of conditions, and in some cases, the ionization type is a bit quicker. In other cases, the ionization type is slower, by time ranges north of half an hour - I've seen some test reports where there was a 45 minute gap, while the photoelectric type was going off, before the ionization type fired!
In general, "rapid fires during the day" are somewhat destructive to property, but rarely kill people. If your kitchen catches on fire while you're cooking, it may burn the house down, but generally people are able to get out.
The fires that kill people are "slow starting fires during the night" - the sort that smolder for potentially hours, often slowly filling the house with toxic smoke, before actually bursting into open flames. On this sort of fire, the photoelectric type will fire long, long before the ionization type - in some cases, they get around to alarming quite literally "after the occupants are dead from the smoke."
Using smoke alarms as a way to talk about monitoring systems is nice, but in terms of actual smoke detectors, get at least a few photoelectric sorts in the main areas of your home.
Do not get the "combined sensor" sort, since these tend to be and-gated and the worst of both worlds.
> Full-scale fire tests are carried out to study the effectiveness of the various types of smoke detectors to provide an early warning of a fire. Both optical smoke detectors and ionization smoke detectors have been used. Alarm times are related to human tenability limits for toxic effects, visibility loss and heat stress. During smouldering fires it is only the optical detectors that provide satisfactory safety. With flaming fires the ionization detectors react before the optical ones. If a fire were started by a glowing cigarette, optical detectors are generally recommended. If not, the response time with these two types of detectors are so close that it is only in extreme cases that this difference between optical and ionization detectors would be critical in saving lives.
Where does the law require both types? I'm not aware of any housing codes specifically requiring photoelectric types, and any house I've looked at, including mine, came with purely ionization types. Though it's been a few years, and it may have changed recently - this is less of a niche concern lately.
As for dual sensors and gating... do you actually trust your life to "nobody will admit what algorithm they use"?
My house has all the smoke detectors wired together (they're on an AC circuit, with battery backup, with a signal line running between them all), so I have some photoelectric and some ionization, depending on where in the house they are.
Oh that I happen to know something about - you’re making stuff up. I’ve got NFPA 72 right here if you want to point out where this alleged requirement for “multi-sensor” detector exists.
In fact there are specific requirements to use only single sensor type alarms - such as near cooking equipment.
You can go the the NFPA website though where the publicly facing website notes they are “recommended”. They are not and have never been required.
Then call yourself a fool since you must ignore at least one recommendation as they are conflicting.
This is not even an official recommendation in the code, just some stupid public education NFPA website, which doesn’t carry the same weight and for good reason. If ionization alarms are dumb enough that the Europeans or anywhere else in the world including the IAFF don’t recommend them at all, I’m ok with just following that. Dual sensor alarms have been shown in real world testing to perform worse. I have seen no evidence they perform better, but I have seen the opposite.
Fire codes and electrical codes are as much driven by industry (both union labor and manufacturers) lobbying in the US as much as actual good evidence based practice. Someone has stuff to sell, that is all. About 20 years ago when the sensible big push was made to migrate to photoelectric alarms it wasn’t long after that a new money-making opportunity was seen by now selling these dual contraptions.
“ In June 2014, tests by the Northeastern Ohio Fire Prevention Association (NEOFPA) on residential smoke alarms were broadcast on Good Morning America program. The NEOFPA tests showed ionization smoke alarms were failing to activate in the early, smoldering stage of a fire. The combination ionization/photoelectric alarms failed to activate for an average of over 20 minutes after the stand-alone photoelectric smoke alarms. This vindicated the June 2006 official position of the Australasian Fire & Emergency Service Authorities Council (AFAC) and the October 2008 official position of the International Association of Fire Fighters (IAFF). Both the AFAC and the IAFF recommend photoelectric smoke alarms, but not combination ionization/photoelectric smoke alarms.”
From the IAFF:
Which one should you buy? The International Association of Firefighters (IAFF), the largest firefighter’s union in the US and Canada has adopted an official position recommending only Photoelectric Smoke Detectors and has stated that dual sensor alarms are no longer acceptable. The technology used in Ionization Smoke Detectors creates a delayed warning in smoldering fires which can lead to loss of life. Photoelectric Smoke Alarms are more effective at warning of smoke from smoldering fires and are less susceptible of nuisance alarms. The IAFF recommends replacing all ionization, dual sensor and unknown alarms with photoelectric smoke alarms.
Notably Iowa fire code had required dual sensor alarms and had to back pedal that last year. Apparently NFPA public outreach hasn’t gotten the memo.
> Either alone will detect less than half of all house fires.
This is nonsense. Ionization sensor may detect certain fires seconds earlier according to NIST testing and those are not even the deadliest types of house fires. No reputable body would recommend PE sensors only if that were true.
> When presented with this tradeoff, the path of least resistance is to say “Let’s just keep the threshold lower. We’d rather get woken up when there’s nothing broken than sleep through a real problem.” And I can sympathize with that attitude. Undetected outages are embarrassing and harmful to your reputation. Surely it’s preferable to deal with a few late-night fire drills.
> It’s a trap.
> In the long run, false positives can — and will often — hurt you more than false negatives. Let’s learn about the base rate fallacy.
Not sure about anyone else, but speaking of alarms, this style of writing trips my "self-promoting snake-oil Internet bullshitter" alarm. It's like nails on a damn chalkboard, and if you're writing like this, you've already lost me; however, maybe I ought not be pointing that out, since signals are nice to have.
Incidentally, I wasn't sure which way the author was gonna go with the core analogy. My smoke alarms have false-alarmed probably 10x as much as my car alarm, even counting times one of us has hit the alarm button on the fob by accident. I've certainly never been so annoyed by my car alarm that I've ripped it out and stuck it in a freezer, as I have with a smoke alarm.
(If I were writing like the author I suppose that last part would have read:
"I've certainly never been so annoyed by my car alarm that I've ripped it out and stuck it in a chest freezer.
I have, with a smoke alarm."
Except also I'd have found a way to use "we" and "you" a bunch.)
I see a lot of this style of writing in articles submitted on HN. I think they are just trying to make the writing more lively, not trying to BS.
A trope of this style is “{interesting half story} but more on that later”.
I don’t think it is a big deal and I don’t see much self promotion here other than vanilla blogging, i.e. sounds like this person is knowledgeable let’s check their bio.
Im not sure what you are responding to in the quoted text but after reading the article I think I can assure you that the author isnt selling you anything more salacious than you would find in a more interesting introduction to probability and statistics lecture.
Short, choppy sentences, lots of second-person, dropping a "punch-line" sentence to its own paragraph like they're a fucking magician revealing the card you pulled earlier. It's some kind of cross between transparent rapport-building sales-psychology crap and setting off a fireworks display to celebrate your successfully assembling a PB&J.
Like listening to a used car salesman tell a mundane story about their morning commute.
But full of unearned and over-the-top dramatic pauses.
Nothing it the article is wrong, per se, but it all seems awfully disconnected from the realities I see in monitoring and alerting?
The end advice is right: you want to build the smoke detector, not the car alarm. But … getting that done, now that's the trick. If the org has car alarms, that's the same org whose PMs will not see the "impact" of that ticket to get the monitoring made ship shape, and that ticket will be backlog-icebox-graveyard'ed.
I've had to get a number of "technical" non-technical roles to try to see that, no, the monitoring software can not automatically generate¹ metrics around your application. Yes, you have to actually instrument the code to add those!
Then that gets combined with systems that are just … not the best? at their job: Datadog has so many rough corners around statistics, where graphs will alias, graphs will change shape depending on zoom, units are a PITA or nowhere to be seen, etc. Sumo a god-awful UI — literally tabs inside tabs, and I can't copy the URL?! — and barely understand structured logging. Splunk is marginally better. Pagerduty permits only the simplest handling of events — don't limit me to a handful of tailored rules: rules are logic, logic is a function: give me WASM. And I want usable business-hours only alerts².
Self-hosted systems are perpetually met with "that's not our core focus", but nobody ever seems to convert the cost of managed monitoring/alert systems into "number of FTE that could be hired to maintain a self-hosted system".
(Oddly, the example in the article is a car alarm. Load avg. is, IMO, a useless metric. Better to measure CPU consumption and IOPS consumption, separately, or probably better, more derivative stats around the things doing the IOPS/CPU.)
¹yes, most system come with some collectors to get system-level stuff like CPU usage, etc. I mean metrics specific to your application.
²PD claims to support them, but in practice, they don't work: alerts received off hours don't alert, true … but they never alert once business resume, either! If you're in an org trying to dig itself out of a mess, you need them to not die in the low-prio pile.
(Ugh. Give me ACL systems in these systems that don't suck: PD locks the routing rules behind like "Admin", and security doesn't want to grant the rank and file "Admin", and so 80% of my devs have no idea how the system works because they're not allowed to see how the system works! Give me the ability to do a WMA business-days-only line for diurnal patterns! The list just goes on and on…)
No SLO in jeopardy, or no immediate measure that needs to be taken? Don't page the oncall; send a low-priority ticket for the service owner to investigate the next business day.
Steps need to be taken, but they're mechanical in nature or otherwise don't give the SRE an opportunity to exercise their brain in an interesting fashion? Replace the alert with an automated handler that only pages the oncall if it encounters an exception.
No playbook, or the playbook consists of useless non-actionable items like, "This alert means the service is running out of frobs"? Write a playbook that explains what the oncall is expected to do when the service needs frobs.
Edit: A dead reply asks if I've ever experienced a novel incident. Of course. Say, for instance, a "This should never happen" error-level log is suddenly happening like crazy, for the first time ever. In that case, you page the oncall, they do their best to debug it, see if they can reach the SWE service owners, read through the code to see if it could be an indicator that SLOs are being violated (e.g., user data corruption) or might be violated soon, and then write a stub playbook to be fleshed out the next business day, probably alongside a code change to handle this situation without spamming the logs so much.