Discussion about this post

User's avatar
RustyLion07's avatar

First of all, great work and outcome from a cost savings standpoint.

What you built seems very specific to logs; I hope your engineering teams are not relying only on logs, as that will sound outdated 😉

Are they able to correlate logs to other signals, such as real user transactions across services via distributed trace and Metrics? Most importantly, is that context available/presented vs. manual or digging through multiple tools?

I often see that observability is equated to a signal, such as logs, which limits the core capability of a true observability system.

From a business standpoint, for observing modern distributed architectures, cost savings-focused goals sound more defensive approach. Rather, it should be more about how we can enhance the user experience, increase developer productivity, or decrease Mean Time To Detection/Resolution (MTTD/R)?

The value driven by such outcomes outweighs the annual cost of observability by simply avoiding an outage, recovering faster, saving hours/days of war rooms, or increasing market share by delivering a top-rated experience. The recent AWS outage is a great example of a situation where, if given a choice, they would want to invest all in observing the infrastructure versus recovering in days; the latter is way more expensive, not just in lost revenue but also in brand perception in a highly competitive market.

Expand full comment

No posts