> The sheer amount of observability data you can collect in wide events grows incredibly fast and most of it ends up never being read.
Yes! I know of at least 3 anecdotal "oh shit" stories w/ teams being chewed by upper management when bills from SaaS observability tools get into hundreds of thousands because of logging. Turns out that uploading a full stack dump on error can lead to TBs of data that, as you said, most likely no-one will look at ever again.
I agree with the broad point- as an industry we still fail to think of logging as a feature to be specified and tested like everything else. We use logging frameworks to indiscriminately and redundantly dump everything we can think of, instead of adopting a pattern of apps and libraries that produce thoughtful, structured event streams. It’s too easy to just chuck another log.info in; having to consider the type and information content of an event results in lower volumes and higher quality of observability data.
A small nit pick but having loads of data that “most likely no-one will look at ever again” is ok to an extent, for the data that are there to diagnose incidents. It’s not useful most of the time, until it’s really really useful. But it’s a matter of degree, and dumping the same information redundantly is pointless and infuriating.
This is one reason why it’s nice to create readable specs from telemetry, with traces/spans initiated from test drivers and passed through the stack (rather than trying to make natural language executable the way Cucumber does it- that’s a lot of work and complexity for non-production code). Then our observability data get looked at many times before there’s a production incident, in order to diagnose test failures. And hopefully the attributes we added to diagnose tests are also useful for similar diagnostics in prod.
I'm currently working with Coroot, which is an open source project trying to create a solution for this issue of logs and other telemetry sources being too much for any team to reasonably have time to parse manually. Data is automatically imported using eBPF and Coroot will provide insights into RCA (with things like mapped incident timeframes) to help with anything overlooked in dumps.
Yes! I know of at least 3 anecdotal "oh shit" stories w/ teams being chewed by upper management when bills from SaaS observability tools get into hundreds of thousands because of logging. Turns out that uploading a full stack dump on error can lead to TBs of data that, as you said, most likely no-one will look at ever again.