another factor to consider is that if you have a typical Prometheus `/metrics` endpoint that gets scraped every N seconds, there's a period in between the "final" scrape and the actual process exit where any recorded metrics won't get propagated. this may give you a false impression about whether there are any errors occurring during the shutdown sequence.
it's also possible, if you're not careful, to lose the last few seconds of logs from when your service is shutting down. for example, if you write to a log file that is watched by a sidecar process such as Promtail or Vector, and on startup the service truncates and starts writing to that same path, you've got a race condition that can cause you to lose logs from the shutdown.
Jfyi, I'm doing exactly this (and more) in a platform library; it covers the issues I've encountered during the last 8+ years I've been working with Go highload apps. During this time developing/improving the platform and rolling was a hobby of mine in every company :)
It (will) cover the stuff like "sync the logs"/"wait for ingresses to catch up with the liveness handler"/etc.
The docs are sparse and some things aren't covered yet; however I'm planning to do the first release once I'm back from a holiday.
In the end, this will be a meta-platform (carefully crafted building blocks), and a reference platform library, covering a typical k8s/otel/grpc+http infrastructure.
That’s an artifact of the original google’s borgmon design. Fwiw, in a “v2” system at Google they tried switching to push-only and it went sideways so they settled on sort of hybrid pull-push streaming api
Prometheus doesn't necessarily lock you into the "pull" model, see [0].
however, there are some benefits to the pull model, which is why I think Prometheus does it by default.
with a push model, your service needs to spawn a background thread/goroutine/whatever that pushes metrics on a given interval.
if that background thread crashes or hangs, metrics from that service instance stop getting reported. how do you detect that, and fire an alert about it happening?
"cloud-native" gets thrown around as a buzzword, but this is an example where it's actually meaningful. Prometheus assumes that whatever service you're trying to monitor, you're probably already registering each instance in a service-discovery system of some kind, so that other things (such as a load-balancer) know where to find it.
you tell Prometheus how to query that service-discovery system (Kubernetes, for example [1]) and it will automatically discover all your service instances, and start scraping their /metrics endpoints.
this provides an elegant solution to the "how do you monitor a service that is up and running, except its metrics-reporting thread has crashed?" problem. if it's up and running, it should be registered for service-discovery, and Prometheus can trivially record (this is the `up` metric) if it discovers a service but it's not responding to /metrics requests.
and this greatly simplifies the client-side metrics implementation, because you don't need a separate metrics thread in your service. you don't need to ensure it runs forever and never hangs and always retries and all that. you just need to implement a single HTTP GET endpoint, and have it return text in a format simple enough that you can sprintf it yourself if you need to.
for a more theoretical understanding, you can also look at it in terms of the "supervision trees" popularized by Erlang. parents monitor their children, by pulling status from them. children are not responsible for pushing status reports to their parents (or siblings). with the push model, you have a supervision graph instead of a supervision tree, with all the added complexity that entails.
Great answer. I managed metrics systems way back (cacti, nagios, graphite, kairosdb) and one thing that always sucked about push based metrics was coping with variable volume of data coming from an uncontrollable number of sources. Scaling was a massive headache. "Scraping" helps to solve this through splitting duty across a number of "scrapers" that autodiscover sources. And by placing limits on how much it will scrape from any given metrics source, you can effectively protect the system from overload. Obviously this comes at the expense of dropping metrics from noisy sources, but as the metrics owner I say "too bad, your fault, fix your metrics". Back in the old days you had to accept whatever came in through the fire hose.
Is it me or are observability stacks kind of ridiculous. Logs, metrics, and traces, each with their own databases, sidecars, visualization stacks. Language-specific integration libraries written by whoever felt like it. MASSIVE cloud bills.
Then after you go through all that effort most of the data is utterly ignored and rarely are the business insights much better then the trailer park version ssh'ing into a box and greping a log file to find the error output.
Like we put so much effort into this ecosystem but I don't think it has paid us back with any significant increase in uptime, performance, or ergonomics.
I can say that going from a place that had all of that observability tooling set up to one that was at the "ssh'ing into a box and greping a log" stage, you best believe I missed company A immensely. Even knowing which box to ssh into, which log file to grep, and which magic words to search far was nigh impossible if you weren't the dev that set up the machine and wrote the bug in the first place.
I completely agree with you but I also think, like many aspects of "tech" certain segments of it have been monopolised and turned into profit generators for certain organisations. DevOps, Agile/Scrum, Observability, Kubernetes, are all examples of this.
This dilutes the good and helpful stuff with marketing bullshit.
Grafana seemingly inventing new time series databases and engines every few months is absolutely painful to try keep up to date with in order to make informed decisions.
So much so I've started using rrdtool/smokeping again.
You might look into https://openobserve.ai/ - you can self host it and it's a single binary that ingests logs/metrics/traces. I've found it useful for my side projects.
if you're working on a system simple enough that "SSH to the box and grep the log file" works, then by all means have at it.
but many systems are more complicated than that. the observability ecosystem exists for a reason, there is a real problem that it's solving.
for example, your app might outgrow running on a single box. now you need to SSH into N different hosts and grep the log file from all of them. or you invent your own version of log-shipping with a shell script that does SCP in a loop.
going a step further, you might put those boxes into an auto-scaling group so that they would scale up and down automatically based on demand. now you really want some form of automatic log-shipping, or every time a host in the ASG gets terminated, you're throwing away the logs of whatever traffic it served during its lifetime.
or, maybe you notice a performance regression and narrow it down to one particular API endpoint being slow. often it's helpful to be able to graph the response duration of that endpoint over time. has it been slowing down gradually, or did the response time increase suddenly? if it was a sudden increase, what else happened around the same time? maybe a code deployment, maybe a database configuration change, etc.
perhaps the service you operate isn't standalone, but instead interacts with services written by other teams at your company. when something goes wrong with the system as a whole, how do you go about root-causing the problem? how do you trace the lifecycle of a request or operation through all those different systems?
when something goes wrong, you SSH to the box and look at the log file...but how do you know something went wrong to begin with? do you rely solely on user complaints hitting your support@ email? or do you have monitoring rules that will proactively notify you if a "huh, that should never happen" thing is happening?
it's also possible, if you're not careful, to lose the last few seconds of logs from when your service is shutting down. for example, if you write to a log file that is watched by a sidecar process such as Promtail or Vector, and on startup the service truncates and starts writing to that same path, you've got a race condition that can cause you to lose logs from the shutdown.