I wish posts like this would explore the relative savings rather than the absolute. On its own I don’t feel like that saving is really telling me much, taken to the extreme you could just not run the service at all and save all the time - a tongue in cheek example but in context is this saving a big deal or is it just engineering looking for small efficiencies to justify their time?
I'm the author of the post. You raise a good point about relative savings. Based on last week's data, our change reduced the task time by 40ms from an average of 3440ms, and this task runs 11 million times daily. This translates to a saving of about 1% on compute.
> This translates to a saving of about 1% on compute.
Does this translate to any tangible savings? I'm not sure what the checkly backend looks like but if tasks are running on a cluster of hosts vs invoked per-task it seems hard to realize savings. Even per-task, 40 ms can only be realized on a service like Lambda—ECS minimum billing unit is 1 second afaik.
I think that’s flawed analysis, if you’re running FaaS then sure you can fail to see benefit from small improvements in time (AWS Lambda changed their billing resolution a few years back but before then the Go services didn’t save much money despite being faster) but if you’re running thousands of requests, and speeding them all up, you should be able to realize tangible compute savings whatever your platform.
Help me to understand, then. If this stuff is being done on an autoscaling cluster, I can see it, but if you are just running everything on an always-on box for instance, it is less clear to me.
edit: Do you have an affiliation with the blog? I ask because you have submitted several articles from checkly in the past.
Hey Checkly founder here, we changed our infra quite a bit over the last ~1 year. Still, it's mostly ephemeral compute. We started actually on AWS Lambda. We are on a mix of AWS EC2 and EKS now, all autoscaled per region (we run 20+ of them).
It seems tiny, but in aggregate this will have an impact on our COGS. You are correct that if we had a fixed fleet of instances, the impact would have been not super interesting.
But still, for a couple of hours spent, this saves us quite some $1Ks per year.
The units seem wrong in any case. It's 3 months of compute per day, which is actually much more impressive.
If we think about the business impact, we don't usually think of compute expenditure per-day, so you might reasonably say, the fix saved 90 years of annual compute. Looks better in your promotion packet, too.
I often ask myself the same question. We have some user facing queries that slow the frontend down. I’ve fixed some slowness but it’s definitely not a priority. I wonder how much speed improvements correlate with increased revenue by happy customers.
Hey, I work at Checkly and asked my coworker (who wrote the post) to give some more background on this. I can assure you, we're busy and this was not done for some vanity price!
Not a problem, but the OP is questioning about the savings!
I, for example, like to dive more on insights like the relative savings vs absolut to learn the approaches other engineers take! It's all about metrics we should take care.
(I'll put this service on my list to try someday, looks like fantastic indeed)