if you're working on a system simple enough that "SSH to the box and grep the lo...

if you're working on a system simple enough that "SSH to the box and grep the log file" works, then by all means have at it.

but many systems are more complicated than that. the observability ecosystem exists for a reason, there is a real problem that it's solving.

for example, your app might outgrow running on a single box. now you need to SSH into N different hosts and grep the log file from all of them. or you invent your own version of log-shipping with a shell script that does SCP in a loop.

going a step further, you might put those boxes into an auto-scaling group so that they would scale up and down automatically based on demand. now you really want some form of automatic log-shipping, or every time a host in the ASG gets terminated, you're throwing away the logs of whatever traffic it served during its lifetime.

or, maybe you notice a performance regression and narrow it down to one particular API endpoint being slow. often it's helpful to be able to graph the response duration of that endpoint over time. has it been slowing down gradually, or did the response time increase suddenly? if it was a sudden increase, what else happened around the same time? maybe a code deployment, maybe a database configuration change, etc.

perhaps the service you operate isn't standalone, but instead interacts with services written by other teams at your company. when something goes wrong with the system as a whole, how do you go about root-causing the problem? how do you trace the lifecycle of a request or operation through all those different systems?

when something goes wrong, you SSH to the box and look at the log file...but how do you know something went wrong to begin with? do you rely solely on user complaints hitting your support@ email? or do you have monitoring rules that will proactively notify you if a "huh, that should never happen" thing is happening?