Caching is one of the few levers that can improve both latency and capacity at the same time. But the moment a system has more than one cache layer, you can no longer talk about “the cache” as a single thing. A request might be served from the browser cache, the CDN edge, a regional cache, an in-process memoisation layer, or not cached at all.
This note builds a simple mental model: treat caching as a hierarchy. Each layer trades freshness for speed and isolation. The practical job is to decide (a) which responses can be cached, (b) how staleness is controlled, and (c) how to debug behaviour when hit ratio or correctness changes.
Quick takeaways
- Think in layersBrowser, edge, and origin caches have different invalidation and observability properties.
- Latency becomes bimodalHits are fast and stable; misses are slower and more variable. A hit-ratio shift changes p95/p99.
- Freshness is a policyStaleness is not a bug by default; it is a consequence of the chosen cache semantics.
- Cache keys matter more than TTLIf the key doesn’t match the user-visible variation, you’ll see correctness bugs and low hit ratio.
- Cache misses amplify origin loadA small miss increase can create a large origin load increase, triggering queueing effects.
Problem framing (why it matters)
Performance work often starts with measuring a slow path and optimising it. Caching flips the approach: you try to avoid executing the slow path at all. That makes caching attractive, but also makes failures more subtle. When caching works, the origin sees fewer requests, so the system is stable. When caching degrades (key mismatch, TTL too low, invalidation floods), the origin suddenly receives a request surge and can collapse into high tail latency.
A cache hierarchy also introduces “two truths”: the user sees content produced by a cache, while the origin logs show something else. Without a clear model, teams misread these signals and accidentally cause regressions by changing cache keys or freshness rules without considering the distribution of traffic.
Key concepts (definitions + mini examples)
Hierarchy layers
A typical hierarchy for web content and APIs looks like:
client (browser/app) → edge/CDN → origin load balancer → service → datastore
“Cache” can exist at multiple points:
- Client cacheFastest, but controlled by client rules and varies across clients.
- Edge/CDN cacheGood for global latency and protecting the origin; invalidation semantics vary.
- Service-side cacheIn-memory or shared cache; offers precise control, but adds operational complexity.
Hit ratio vs hit population
Hit ratio is usually reported as a single number, but what matters is which requests hit. A system can have a “good” overall hit ratio while critical endpoints miss, producing poor user experience.
In practice, track hit ratio by request class (endpoint, status code bucket, content type) and by region. This also helps detect key mistakes, where a subset of traffic is forced to miss due to an accidental key variation.
Freshness semantics
Every caching strategy chooses a freshness story. Three common ones are:
- Time-based (TTL)Content is valid until a deadline. Simple, but can serve stale data after changes.
- Validation-basedCaches revalidate using a lightweight check, reducing stale risk at the cost of extra round trips.
- Invalidate-on-writeOn changes, invalidate keys. Correctness can be good, but invalidation storms are a real failure mode.
The important point: “stale” is not inherently wrong. Stale is wrong only when it violates what users expect. That expectation should be part of the spec: which fields must be current, how quickly changes must propagate, and what happens during partial failures.
Practical checks (steps/checklist)
1) Confirm where the response was served
When debugging, the first question is always: did this response come from the client cache, an edge cache, or the origin? If you do not have a consistent way to answer this, debugging will be slow and speculative.
- Check the distributionAre there two clusters: fast hits vs slow misses?
- Check regional behaviourEdge hit ratio can vary strongly by geography and by PoP.
- Check request classAPI endpoints may have different cacheability and key structure.
2) Validate the cache key
Cache keys should reflect the user-visible variation. If content varies by language, device class, auth state, or query parameters, the key needs to include the right subset. Too few variations causes correctness bugs (wrong content for a user). Too many variations destroys hit ratio.
A practical technique is to write down the “variation contract” as a short list: which request inputs are allowed to change the response. Then check whether the cache key aligns with that contract.
3) Inspect miss amplification
Misses are not just slower; they create load. Suppose an edge cache has a 95% hit ratio and the origin receives 5% of requests. If that hit ratio drops to 90%, origin load doubles. The origin may move from “comfortable” to “queueing” with a surprisingly small hit-ratio change.
When you see a performance regression, ask whether it could be a cache regression. In practice, a cache regression looks like: stable hit latency but rising miss frequency, followed by rising origin latency and tail spikes.
4) Decide on staleness tolerance explicitly
If a team cannot state the acceptable staleness window, caching decisions become ad-hoc. A useful minimal spec is:
For endpoint X, data may be up to N seconds old; during partial failures, serve stale for up to M minutes.
This simple statement allows you to evaluate TTL choices, “stale-while-revalidate” patterns, and invalidation strategies without conflating correctness with freshness.
Common pitfalls
- Changing cache keys accidentallyNew headers, query parameters, or normalisation changes can collapse hit ratio overnight.
- Over-variant keysIncluding too much (e.g., session IDs) makes caching ineffective and adds operational noise.
- Under-variant keysServing the wrong personalised content is a correctness incident, not “a cache edge case”.
- Ignoring miss amplificationA small miss increase can trigger queueing and make the whole origin unstable.
- Debugging without a serve-layer signalWithout knowing where a response came from, every hypothesis remains ambiguous.
Related notes
- TTFB and origin latencyTTFB often shifts because cache hit ratio shifts.
- Queueing basics and latency budgetsMiss amplification frequently triggers queueing and tail spikes.
- Performance regressions checklistA workflow that includes cache regressions as a first-class hypothesis.
- Failure modes, retries, and idempotencyRetries can create bursts that defeat caches and overload origins.