Failure modes, retries, and idempotency - Distributed systems notes

Retries are a rational response to failure: if a request might have been lost, try again. The trap is that failures are often ambiguous: a client may not know whether the server never saw the request, saw it and crashed, or processed it and the response was lost. When you retry in the presence of ambiguity, you must assume duplication.

This note is about designing interfaces that survive retries. Idempotency is the core idea: an operation can be applied multiple times without changing the result beyond the first application. Idempotency turns “duplicate request” into a tolerable condition rather than a catastrophe.

Quick takeaways

Ambiguity is the reason retries are hardThe client cannot reliably know whether the effect happened.
Retries imply duplicatesIf you retry, you must design for “at least once”.
Idempotency keys are a boundary toolThey allow a server to deduplicate and return a stable outcome.
Safe retries need a persistence storyDeduplication state must live long enough to cover retry windows.
Measure duplicate ratesIt is operationally valuable to know how often retries happen and why.

Problem framing (failure modes that create ambiguity)

Many failures look identical from the client:

Request lostClient sends; server never receives.
Response lostServer processes; response never arrives back to client.
Server crash after effectServer applies effect, then crashes before responding.
Timeout but late completionClient times out and retries while the original request is still in flight.

In all of these cases, “retry” is a reasonable client behaviour. That is why interfaces must treat duplicate requests as normal.

Diagram: ambiguous failures lead to retries and potential duplicates

Key concepts (definitions + mini examples)

Idempotent vs non-idempotent

An idempotent operation yields the same end state when applied multiple times. Many reads are idempotent. Some writes can be made idempotent (set a value), while others are naturally non-idempotent (increment, charge).

A practical trick is to reframe non-idempotent actions as “create a uniquely identified intent” and then “apply that intent once”. That is the role of an idempotency key.

Idempotency keys

A client sends an idempotency key along with the request. The server records the key and the outcome. If the same key is seen again, the server returns the same outcome instead of applying the effect again.

Key design decisions:

ScopeIs the key unique per user? per endpoint? per tenant?
RetentionHow long is dedup state stored? It should cover the maximum retry window.
Outcome storageStore enough to return a stable result (success payload or error classification).

Practical checks (design + operations checklist)

1) Classify every operation by retry-safety

For each write operation, answer: if the client retries, what happens? If the answer is “it depends”, you probably need an explicit idempotency strategy.

2) Make the retry contract explicit

Document the contract: whether duplicates are allowed, what key is used, and which error cases are safe to retry. This is a contract boundary, not just an implementation detail.

3) Persist deduplication state atomically with the effect

The typical bug is a split-brain write path: apply the effect but fail to record the key, so retries apply the effect again. The dedup record should be part of the same atomic commit as the effect, or derived from it in a way that cannot diverge.

4) Measure and alert on retries

Operationally, retry rates are signals. High retry rates mean timeouts, downstream instability, or overloaded queues. Tracking “dedup hits” can tell you how often the system relies on idempotency to remain correct.

Common pitfalls

Assuming “exactly once” deliveryIn practice, you often get “at least once”; build for it.
Retrying non-retryable errorsSome failures are permanent; retries only amplify load and worsen incidents.
Dedup keys that collideIf two different intents can share a key, you can incorrectly merge them.
Retention that is too shortIf dedup state expires before retries stop, duplicates reappear.
Idempotency that ignores side effectsIf you send emails or publish events, you need a dedup story there too.

Related notes

Specs, invariants, and contractsIdempotency is a contract boundary: assumptions + guarantees.
Time, clocks, and orderingTimeouts and reorderings are common sources of retries and duplicates.
Observability for distributed systemsTo debug retries, you need traces and error classification.
Performance regressions checklistRetries can look like regressions; classify and measure them.

Back to topic • Back to Notes index