Observability for distributed systems - Distributed systems notes

In a distributed system, a single user-visible failure is often a chain: a timeout leads to a retry, which creates extra load, which increases queueing, which increases tail latency, which triggers more timeouts. Observability is the discipline of making that chain visible.

This note focuses on a simple principle: instrument boundaries. Most debugging questions are boundary questions: what did the service receive, what did it decide, what did it send, and how long did each part take? If boundaries are instrumented consistently, “mystery incidents” become traceable sequences.

Quick takeaways

Start with a request storyWhat is the unit of work? Where does it flow?
Logs answer “what happened”They need structure: ids, outcomes, and reasons.
Metrics answer “how often”They need good dimensions: outcome classes and dependency breakdowns.
Traces answer “where the time went”Spans should align with boundaries and dependency calls.
Tail latency is the signalp95/p99 often matter more than averages for correctness under timeouts.

Problem framing (why distributed incidents are hard)

Distributed failures are compositional. Each service behaves “reasonably” locally: it times out, it retries, it sheds load, it returns a fallback. Globally, those reasonable behaviours can amplify each other.

The goal of observability is not perfect visibility. It is the ability to answer a small set of questions quickly: where did the request fail? was it a retry? what dependency dominated latency? did the system violate an interface contract?

Diagram: logs, metrics, and traces around service boundaries

Key concepts (what each signal is for)

Structured logs

Logs are most useful when you can correlate events. That means consistent ids and stable fields. A practical baseline for an API boundary log entry:

{
  "requestId": "…",
  "operation": "charge",
  "outcome": "success|retryable_error|fatal_error",
  "latency_ms": 123,
  "idempotencyKey": "…",
  "dependency": "payments-db",
  "errorClass": "timeout|validation|conflict"
}

The key is not the exact schema; it is that “outcome” and “errorClass” are explicit, not inferred from text.

Metrics with meaningful dimensions

Metrics tell you rates and distributions. In distributed systems, the most valuable dimensions are often: outcome class, dependency, retry status, and queue depth. Without these, you can see that “errors increased” but not why.

Tracing and spans

Traces show latency composition. A good span structure matches your architecture: one span for the inbound request, child spans for dependency calls, and explicit spans for queueing or background steps. If you cannot see queueing as a span, you will often misdiagnose “slow database” when the real issue is backlog.

Practical checks (instrumentation checklist)

1) Define a stable unit of work

Pick an identifier that follows the request across services (trace id) and a local id that captures retries. Many debugging sessions become easy once you can answer: “is this the first attempt or a retry?”

2) Classify outcomes explicitly

Treat outcome classification as part of the interface contract: success, retryable error, fatal error. If clients need to retry safely, they need the correct classification.

3) Break down latency at boundaries

For each service, track: total time, time waiting (queueing), time in dependencies, and time in local computation. This decomposition makes it possible to link performance symptoms to capacity causes.

4) Track saturation signals

Saturation indicators include queue depth, concurrency limits, and retry rate. They help diagnose the feedback loop where retries create more load.

Common pitfalls

Relying on unstructured logsFree-text logs are hard to query and correlate during incidents.
Measuring only averagesAverages hide tail latency, which triggers timeouts and retries.
Missing queueing visibilityBacklog is a common cause of latency; if you cannot see it, you misdiagnose.
Not tagging retriesIf you cannot distinguish first attempts from retries, you can’t reason about amplification.
Over-instrumenting without intentCollect signals that answer a question; otherwise they become noise.

Related notes

Failure modes, retries, and idempotencyRetries form feedback loops; you need to measure them.
Queueing basics and latency budgetsQueueing explains why small load changes can create large tail latency shifts.
TTFB and origin latencyBreakdown helps distinguish network vs origin vs queueing.
Specs, invariants, and contractsOutcome classes and retry semantics are part of the contract.

Back to topic • Back to Notes index