Start here
A path from core ideas to engineering consequences.
- Time, clocks, and orderingWhat “happened before” means and why clock time is not enough.
- Failure modes, retries, and idempotencyWhy retries create duplicates and how idempotency makes behaviour predictable.
- Observability for distributed systemsTracing, correlation, and turning incidents into evidence.
All notes in this topic
- Consensus without the hypeWhat consensus is (and isn’t), what it buys you, and the real cost drivers.
- Time, clocks, and orderingPhysical time vs logical time, causality, and why ordering decisions affect correctness.
- Failure modes, retries, and idempotencyDuplicate requests, partial failures, and designing operations so retry behaviour is safe.
- Observability for distributed systemsSignals that let you localise latency and failure across services.
Common pitfalls
- Treating time as a single clockWall-clock timestamps are useful, but they do not define causality under delay and drift.
- Assuming “retry” means “try again”Retries change semantics: they create duplicates and can amplify load during incidents.
- Using consensus as a cure-allConsensus solves a specific coordination problem; it does not remove the need for good data modelling.
- Debugging without correlationLogs without request identifiers or traces often produce plausible narratives, not evidence.
Related topics
- Performance notesTail latency and queueing are often triggered by distributed retries and fan-out.
- Formal methods notesProtocols and invariants provide a crisp language for distributed correctness.