Hacker News new | past | comments | ask | show | jobs | submit login

Prometheus doesn't necessarily lock you into the "pull" model, see [0].

however, there are some benefits to the pull model, which is why I think Prometheus does it by default.

with a push model, your service needs to spawn a background thread/goroutine/whatever that pushes metrics on a given interval.

if that background thread crashes or hangs, metrics from that service instance stop getting reported. how do you detect that, and fire an alert about it happening?

"cloud-native" gets thrown around as a buzzword, but this is an example where it's actually meaningful. Prometheus assumes that whatever service you're trying to monitor, you're probably already registering each instance in a service-discovery system of some kind, so that other things (such as a load-balancer) know where to find it.

you tell Prometheus how to query that service-discovery system (Kubernetes, for example [1]) and it will automatically discover all your service instances, and start scraping their /metrics endpoints.

this provides an elegant solution to the "how do you monitor a service that is up and running, except its metrics-reporting thread has crashed?" problem. if it's up and running, it should be registered for service-discovery, and Prometheus can trivially record (this is the `up` metric) if it discovers a service but it's not responding to /metrics requests.

and this greatly simplifies the client-side metrics implementation, because you don't need a separate metrics thread in your service. you don't need to ensure it runs forever and never hangs and always retries and all that. you just need to implement a single HTTP GET endpoint, and have it return text in a format simple enough that you can sprintf it yourself if you need to.

for a more theoretical understanding, you can also look at it in terms of the "supervision trees" popularized by Erlang. parents monitor their children, by pulling status from them. children are not responsible for pushing status reports to their parents (or siblings). with the push model, you have a supervision graph instead of a supervision tree, with all the added complexity that entails.

0: https://prometheus.io/docs/instrumenting/pushing/

1: https://prometheus.io/docs/prometheus/latest/configuration/c...






Great answer. I managed metrics systems way back (cacti, nagios, graphite, kairosdb) and one thing that always sucked about push based metrics was coping with variable volume of data coming from an uncontrollable number of sources. Scaling was a massive headache. "Scraping" helps to solve this through splitting duty across a number of "scrapers" that autodiscover sources. And by placing limits on how much it will scrape from any given metrics source, you can effectively protect the system from overload. Obviously this comes at the expense of dropping metrics from noisy sources, but as the metrics owner I say "too bad, your fault, fix your metrics". Back in the old days you had to accept whatever came in through the fire hose.

Thanks for writing this out; very insightful!



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: