More

pokoleo · 2025-03-18T03:24:17 1742268257

From my experience working on SaaS, and improving ops at large organizations, I've seen that "on-call culture" often exists inversely proportional to incentive alignment.

When engineers bear the full financial consequences of 3AM pages, they're more likely to make systems more resilient by adding graceful failure modes. When incident response becomes an organizational checkbox divorced from financial outcomes and planning, you get perpetual firefighting.

The most successful teams I've seen treat on-call like a leading indicator - every incident represents unpriced technical debt that should be systematically eliminated. Each alert becomes an investment opportunity rather than a burden to be rotated.

Big companies aren't missing the resources to fix this; they just don't have the aligned incentive structures that make fixing it rational for individuals involved.

The most rational thing to do as an individual on a bad rotation: quit or transfer.

DanHulton · 2025-03-18T05:03:14 1742274194

This assumes that the engineers in question get to choose how to allot their time, and are _allowed_ to spend time to add graceful failure modes. I cannot tell you how many stories I have heard of, and companies I have directly worked at, where this power is not granted to engineers, and they are instead directed to "stop working on technical debt, we'll make time to come back to that later". Of course, time is never found later, and the 3am pages continue because the people who DO choose how time is allocated are not the ones waking up at 3am to fix problems.

nijave · 2025-03-18T06:06:51 1742278011

Definitely an issue but I think there's a little room for push back. Work done outside normal working hours is automatically the highest priority, by definition. It's helpful to remind people of that.

If it's important enough to deserve a page, it's top priority work. The reverse is also true (if a page isn't top priority, disable the paging alert and stick it on a dashboard or periodic checklist)

whstl · 2025-03-18T09:14:55 1742289295

You're right, but it's still outrageous that engineers need to burn political capital in order to have proper sleep and avoid burnout.

nijave · 2025-03-18T20:46:18 1742330778

Agreed.

tacticus · 2025-03-18T03:54:13 1742270053

IMO it's when the incident response and readiness practice imposes a direct backpressure on feature delivery that you get the issues actually fixed and a resilient system.

if it's just the engineer while product and management see no real cost then people burn out and leave.

> The most successful teams I've seen treat on-call like a leading indicator - every incident represents unpriced technical debt that should be systematically eliminated. Each alert becomes an investment opportunity rather than a burden to be rotated.

100%

hinkley · 2025-03-18T16:29:03 1742315343

The people who do the real work don’t get raises and promotions because the annual review system punishes them for doing the right thing.

rednafi · 2025-03-18T04:12:52 1742271172

> When engineers bear the full financial consequences of 3AM pages, they're more likely to make systems more resilient by adding graceful failure modes.

Making engineers handle 3 AM issues caused by their code is one thing, but making them bear the financial consequences is another. That’s how you create a blame-game culture where everyone is afraid to deploy at the end of the day or touch anything they don’t fully understand.

cyberax · 2025-03-18T06:42:43 1742280163

"Financial consequences" probably mean "the success of the startup, so your options won't be worth less than the toilet paper", rather than "you'll pay for the downtime out of your salary".

mrkeen · 2025-03-18T07:39:55 1742283595

Engineers don't pick their work, management does.

A manager no longer needs to choose between system reliability and churning out new features with on-call:

The manager can get all the credit for pushing out new features during the day, and sleep well at night knowing that the engineers aren't.

procaryote · 2025-03-18T11:04:52 1742295892

At a lot of companies engineers are involved in picking the work. It's silly to hire competent problem solvers and treat them as unskilled workers needing micro-management.

Besides, if you set the on-call system up so people get free time the following day to compensate for waking up at night, the manager can't pretend there's no cost.

Bad management will fail on both of these of course, but there's no saving that beyond finding a better company.

whstl · 2025-03-18T12:51:32 1742302292

It is silly indeed but unfortunately this is what happens in companies that don’t have a good engineering culture.

northern-lights · 2025-03-18T07:03:47 1742281427

This assumes that the engineers who wrote the code that caused the 3 AM pages will still be around to suffer the consequences of the 3 AM pages. This is a lot of times, not true, especially in an environment which fostered moving around internally every now and then. Happens in at least one of the FAANGs.

meeshmuesh · 2025-03-18T13:28:28 1742304508

Sounds like Amazon

adrianN · 2025-03-18T05:55:13 1742277313

Minimizing 3am pages is good for engineers but it is not necessarily the best investment for the company. Beyond a certain scale it is probably not a good investment to try to get rid of all pages.

mook · 2025-03-18T07:15:12 1742282112

By that point wouldn't it start to make sense to have people across time zones so that it will be working hours somewhere?

pokoleo · on April 3, 2024

Needs (2015)

wilsonjholmes · on April 4, 2024

Noted.

wilsonjholmes · on April 4, 2024

Unfortunately, I forgot that title editing seems to time out... I am no longer able to edit title.

pokoleo · on Nov 20, 2023

Wouldn’t be the first. See also Ecuador, El Salvador, Panama, and Bonaire (to name a few)

thriftwy · on Nov 20, 2023

This starts to sound like a for-profit scheme of the USA.

Having problems with debt? Loan a LatAm country $100B to fuel their economy. Does wonders to balance sheets.

charlieyu1 · on Nov 20, 2023

Compared to China loaning to African countries to buy UN votes? It works wonders for China.

thriftwy · on Nov 20, 2023

China loaning USD to Africa is even better for the US. They get to consume the equivalent of that amount of cash - it will not be redeemed for some time now.

pokoleo · on Nov 8, 2023

The usual answer to a DDoS from some users is to add rate limiting, not “remove the service”

pokoleo · on May 19, 2023

I've been using Colemak for a little bit more than a decade now. Super happy with it.

I switched while interning at a ~failing startup. I was a Canadian in the US, and had forgotten to plan to do stuff over Thanksgiving weekend. I had nothing to do, so I switched to Colemak over the weekend. I spent the weekend doing typing training videos, then spent the remaining ~1mo of my co-op term working (almost) entirely in Colemak. I wouldn't switch back to qwerty without a really compelling reason.

Years later, I'm super happy. I can use QWERTY under duress, but rather not.

SkyPuncher · on May 19, 2023

I’m in basically the same situation. The minor annoyance of key bindings is worth it for significantly reduced hand strain and slightly faster typing speeds.

pokoleo · on March 2, 2023

It's a nice way to say "Hello there :)"

pokoleo · on Dec 20, 2022

Can you cite a source on the Japanese connection? I don't see anything obvious in their Wikipedia pages.

* https://en.wikipedia.org/wiki/The_Economist

* https://en.wikipedia.org/wiki/The_Economist_Group

I do see Exor (Agnelli), Rothschild, Cadbury, and Schroders families, but nothing clearly Japanese, nor Japanese far-right.

unity1001 · on Dec 21, 2022

Sorry, that was the Financial Times. Regardless of that tidbit, the other statements still hold true. Unfortunately.

pokoleo · on July 7, 2022

I once interned at an easily google-able Secondlife competitor. They fought against NSFW content for a long time, but then figured out how to fix it by:

1. Incentivizing users (with in-game currency) for reporting NSFW content, and 2. Restricting NSFW content to only people who bought an all-access pass (ID verified at time of purchase)

This opened up a new revenue stream for the company, and dealt with the NSFW content in one swoop.

happyopossum · on July 7, 2022

A) I don’t think #2 would be A good idea for a kids game, and B) kids will absolutely start to game #1 with shill accounts and you may well wind up increasing the amount of ‘evil stuff’ as kids bring it to the platform for the sole purpose of reporting it to get Robux/swag.

mock-possum · on July 8, 2022

AKA The Cobra Effect: https://en.wikipedia.org/wiki/Perverse_incentive#The_origina...

zkldi · on July 7, 2022

Sounds like it'd just incentivise people to make NSFW content on burners and then report it.

pokoleo · on April 30, 2022

Soft deletion always feels at odds with privacy-related "right to have data deleted" laws.

Would be super interested in a technical writeup on how they do this.

h1fra · on April 30, 2022

"Right to have data deleted" can be 'circumvented' if the data is critical part of the system or is needed for legal purpose (for example it can be mandatory to keep 1 year of IP logs and data associated with it)

In previous companies I have worked for, we did instant soft-delete, then hard anonymisation after 15-30days and then hard delete after a year. That means the data was not recoverable for customer but could still be recovered for legal purpose.

baskethead · on April 30, 2022

There's a time period before which you need to permanently delete the data. A soft delete will allow you to delete the data quickly and you can see what happens. If everything is okay you can then purge your database of all soft deleted data.

Jcowell · on April 30, 2022

It shouldn’t be. These laws at least have the nuance to understand that data can’t be immediately deleted from Backups and that in such instances where deletes are complicated the customer is notified.

jasonwatkinspdx · on April 30, 2022

IANAL but the laws have carve outs for backup retention, etc.

A simple technical solution is to store all data with per user encryption keys, and then just delete the key. This obviously doesn't let you prove to anyone else that you've deleted all copies of the key, but you can use it as a way to have higher confidence you don't inadvertently leak it.

notreallyserio · on April 30, 2022

Ideally they'd encrypt the customer content with a key provided by the customer and destroyed when the customer requests account deletion. The customer would still be able to use their key to decrypt backups that they get prior to the request. If the customer changes their mind, they just upload the key again (along with the backup, if necessary).

Of course, this means trusting Atlassian to actually delete the key on request, but there's not much reason for them not to.

stubish · on April 30, 2022

Restoring data from backup is the most common data recovery technique. Lots of information there to start from if you are interested in how data recovery relates to privacy laws.

pokoleo · on April 30, 2022

Will's a great communicator, and quite personable: I know this because I've worked with him for a few years. I don't think it's bravado, SRE is a common industry term.

XorNot · on April 30, 2022

So are all sorts of terms, but in the sciences at least the rule is still that you introduce the full term and its acronym the first time it is used, before switching to the acronym for the rest of the paper. Sure, everyone doing electrochemistry knows what EIS means, but you still write "electrochemical impedance spectroscopy (EIS)" before you go on.

You can adjust the strictness for audience and intent perhaps, but it's still sloppy not to.

mwt · on April 30, 2022

And on your blog you can choose to use common abbreviations like OS, BIOS, and GPS without spelling them out.