Queueing basics and latency budgets - Performance notes

Many “mysterious” latency regressions are queueing regressions. A service might have the same code and the same average request cost, and yet p95 and p99 can jump dramatically. The missing variable is utilisation: as a system approaches its capacity, even small demand increases create waiting time.

This note does not aim to teach queueing theory as a formal subject. Instead it gives a practical vocabulary: what to measure, what patterns suggest queueing, and how to design a latency budget that remains a useful engineering tool rather than a spreadsheet artifact.

Quick takeaways

Queues are inevitableIf demand is bursty and capacity is finite, waiting appears somewhere—even if you hide it.
Tail latency is the symptomQueueing typically hits p95/p99 more than p50; mean latency can look normal while users suffer.
Utilisation is the control knobRunning “hot” can be efficient, but it makes tail latency fragile.
Budgets must include variabilityA budget based on steady-state averages breaks the moment traffic becomes bursty or retries occur.
Concurrency is not freeIncreasing parallelism can raise throughput, but can also create contention and longer queues.

Problem framing (why it matters)

Users experience systems through the slowest part of the path. That matters because load is not uniform: traffic spikes, coordinated retries, cache misses, and background jobs all create bursty demand. If your service is designed to run close to its limit, a small burst pushes the system into a regime where waiting dominates work.

Queueing is also a cross-cutting explanation. It connects performance to distributed behaviour: retries and fan-out can amplify demand; cache misses can double origin load; and observability is needed to prove where the waiting occurs.

Key concepts (definitions + mini examples)

Service time vs waiting time

For a request, “latency” often includes two distinct pieces:

response time = waiting time + service time

Service time is the time the system actively works on your request (CPU, I/O, dependency calls). Waiting time is time spent queued because resources are busy. When utilisation is low, waiting is near zero and response time tracks service time. When utilisation is high, waiting dominates.

Why p95/p99 jump first

Queueing interacts with variability. Even if average service time is small, a few slow requests occupy workers longer, forcing subsequent requests to wait. Those waits stack up. The median request might still be fast, but the tail sees the compounded waiting.

Diagram: as utilisation approaches capacity, waiting time rises sharply

Latency budgets as contracts

A latency budget is a constraint on the end-to-end path: a rough allocation of how much time each stage is allowed to consume. A useful budget is not “a number to hit” but “a tool to make trade-offs explicit”.

Budgets are most useful when they are tied to a percentile (e.g. p95 or p99) and to a load condition (e.g. a given requests-per-second). Without that, the budget can be met only under unrealistically quiet conditions.

Practical checks (steps/checklist)

1) Look for a queueing signature

A common queueing signature is: p50 shifts a little, p95 and p99 shift a lot, and the effect becomes more visible during higher traffic periods. Another signature is a clear correlation between concurrency (or CPU) and latency.

Compare p50 vs p95Queueing inflates the tail; uniform added work inflates all percentiles similarly.
Correlate with loadIf latency tracks QPS, concurrency, or CPU, queueing is likely involved.
Check retries and fan-outA burst of retries can create transient overload and long queues.

2) Find the queue location

Waiting appears where a resource is bounded: a worker pool, a DB connection pool, a CPU core set, an external rate limit. The fastest way to localise is to list bounded resources in the request path and check which is saturated during the incident.

Be careful with hidden queues. For example, a thread pool might accept tasks and queue them internally, so CPU can look “normal” while waiting time grows.

3) Decide whether to reduce demand or increase capacity

Queueing problems can be addressed from two sides:

Reduce demandCaching, batching, deduplication, shedding optional work, or cutting fan-out.
Increase effective capacityOptimise the slow path, add replicas, raise pool limits carefully, or isolate noisy neighbours.

The important point is that “increase concurrency” is not automatically capacity. If work is CPU-bound, more concurrency can increase context switching and worsen throughput.

4) Make the budget operational

A budget only helps if it is connected to measurement. For each stage, decide what metric represents its budget usage (e.g. edge wait time, origin compute time, DB query latency) and ensure it is collected at the same percentile you care about.

Common pitfalls

Optimising mean latencyQueueing hides in the tail; mean latency can be stable while p99 doubles.
Running too hotA system at high utilisation has little headroom for bursts and becomes fragile.
Increasing concurrency blindlyMore parallelism can increase contention and reduce throughput.
Ignoring retry amplificationRetries can turn a small slowdown into an overload event.
Budgets without load conditionsA budget that is true only at low traffic leads to unpleasant surprises in production.

Related notes

TTFB and origin latencyQueueing often shows up as delayed first byte from the origin.
Cache hierarchy: edge to originMiss amplification is a common trigger for queueing events.
Performance regressions checklistA workflow that includes queueing as a hypothesis.
Failure modes, retries, and idempotencyRetries can create bursty demand and longer queues.

Back to topic • Back to Notes index