Architecting Observability
May 2023 - Alex Alejandre

tl;dr: 10 construction workers stand and watch one man dig a hole, that’s the cloud today!

Peter Bourgon’s observability hart

Peter Bourgon’s chart here defines common vocabulary to reason about observability data. I will only use “logging” and “metrics”.

In Code

When containerized, telemtry goes to stdout/stderr so ops can route it. This decouples it from processors, avoiding vendor lock in (so you can leave Datadog.) It also centralizes the logic, so different services don’t handle it (in haphazard ways.) Above all, the user should decide what to do with the logs.

In Business

Ignoring video streaming, logging operations represent the great majority (I’ve seen 80%) of generated data, requests, compute load etc. in a given (online) system.

Logs make sense when they save more (dev hours etc.) than they cost (including future maitenance risks.) Spending $500 so a developer (paid $1000/day) solves a task in one day instead of two, makes sense. (We developers are a stingy lot, generally unwilling to pay for any tooling, no matter the productivity increase.)

Logging is expensive. Vended logs are insanely expensive. Imagine millions of tiny requests, which each output multiple lines of logs, each packaged in a json object with their pod name, namespace etc. streamed to CloudWatch, Splunk, SumoLogic. N.b. Loki on EKS with S3 backend is relatively cheap. Scalyr looks interesting too.

We must balance the following equation: (metric cost) = (eng. cost/hr) * (eng. hrs saved) where (metric cost) is (hardware cost) + (mateinance hours) * (engineer cost/hr) and uncertainty + temptation to add features Uncertainty in future hardware cost, maintenance hours, engineer salary, outages etc. Mateinance hours are fairly static. Hardware costs are reasonably linear. To simplify this and remove uncertainty and temptation from the equation, SaaS providers higher rates per GB. At small scales, this makes sense (building such a pipeline for a company with 3 engineers won’t increase velocity enough) but quickly stops making sense from a purely financial perspective. The latter two impact an organization in contradictory, difficult-to-quantify ways. Instead of spending time on tooling, focusing on your core competency generates edge, but engaging with logs teaches engineers how to get the most out of them and spreads overall perspective into business matters.

In 2022, Coinbase had 17 outages, totalling about 12 hours of downtime. The company’s daily average revenue is around $9M/day, based on their 2022 earnings.

In Gergely Orosz' example, assuming Datadog cuts outages in half and helps mitigate them 50% faster, without it these 12 hours’d become 36, so Datadog saved them $9M, for $10M.

There are simpler ways though, e.g. simply aggregating everything into actionable metrics at the point of generation (like edge computing) before centralizing them (while saving historical log samples at different granularities.) Processing them yourself on prem e.g. with Graylog and elastic search clusters. 3 nodes each to start Disc space is cheap; we can ingest and store 1000 GBs for under $15. We don’t need to back them up or persist for long. We don’t have to wait for $2mm in spend before hiring a team.

I’ve only been talking about developers. But good logs can:

  • show product managers insights into user behavior
  • help operations iagnose production issues
  • detect security incidents and access patterns for security teams
  • generate KPIs for the board (ROI, forecast budget needs)
  • provide data for audits (compliance, data protection etc.)
  • shed light on cost patterns in resource utilization
  • increase developer velocity (information to fix bugs, spot bottlenecks etc.)

Turning passive logs into analytics, metrics, yea decisions demands concerted effort and engagement. The juice can be worth the squeeze (if you can keep costs between 5-10%), but beware for cargo culting haunts our industry.

P.s. I’ll build anyone an on prem solution for 1/10th of what they currently spend, whose yearly cost should be even less than that.