“This refactor is safe” often means “the observable behaviour does not change”. That sounds simple until you ask: observable to whom? A user may observe response codes and content; an operator may observe latency and logs; another service may observe timing and retries. Behavioural equivalence is the discipline of defining what counts as an observation, then checking whether two systems are indistinguishable under those observations.

Formal methods provide several equivalence notions (trace equivalence, bisimulation, observational equivalence). You do not need the full taxonomy to use the idea. The practical goal is to avoid accidental changes: changing error timing, adding a new visible intermediate state, or breaking protocol expectations.

Quick takeaways

  • Equivalence depends on observationIf observers can see more, equivalence becomes harder to satisfy.
  • Traces are a simple starting pointCompare sequences of externally visible events (requests, replies, errors).
  • Bisimulation is about step-by-step matchingIf one system can take a step, the other can match it with an equivalent step.
  • Refinement is “allowed to be more deterministic”A refinement preserves promised behaviour while restricting or resolving choices.
  • Hidden steps matter only if they leakInternal states are fine unless they become externally observable (timing, errors, outputs).

Problem framing (why it matters)

Equivalence is the conceptual bridge between specification and implementation change. It lets you justify transformations: optimisations, protocol refactors, caching layers, or concurrency changes. Without an equivalence notion, “safe change” reduces to hope.

The subtlety is that many systems have non-functional observations: latency, ordering, retries. If a client uses timeouts, then timing becomes part of behaviour. That is why equivalence discussions often belong in design reviews for distributed and performance-critical systems.

Key concepts (definitions + mini examples)

Observation model

Choose what the observer can see. For a simple API, the observation model might be: request → response code + response body. For a protocol, it might include message order. For a performance-sensitive client, it might include response time relative to a timeout.

Two systems are equivalent if no observer (within the model) can distinguish them.

Diagram: two implementations can be equivalent under one observation model and not under another

Trace equivalence (simple view)

A trace is a sequence of visible events. If two systems have the same set of possible traces, they are trace equivalent. This is a simple criterion that often works well for request/response behaviour.

Trace equivalence can miss some distinctions: a system that can “choose” internally may have the same traces as one that makes a different choice later. If you need to reason about branching structure and interaction, you may want a stronger notion.

Bisimulation (step-by-step matching)

Bisimulation is a stronger idea: whenever one system can make a visible step, the other system can make a matching step to a related state. Intuitively, the two systems can “play the same game” move by move.

In practice, you can use the intuition without the formal definition: if a refactor introduces a new intermediate externally visible state, it often breaks step-by-step matching.

Practical checks (steps/checklist)

1) Write down the observer

Before changing an implementation, list the observers: user, other services, monitoring, operators. For each, list what can be observed. This prevents “surprising” behaviour changes that were only surprising because the observation model was implicit.

2) Enumerate visible events

Make a small table of visible events per request: success responses, error codes, retry semantics, and any exposed ordering constraints. If you cannot enumerate them, the interface may be underspecified.

3) Check timing-sensitive clients explicitly

If any client uses timeouts, then latency is part of behaviour. A change that keeps functional outputs but doubles p99 latency may break equivalence from the client’s point of view.

4) Use refinement language

Many safe changes are refinements: they reduce nondeterminism while preserving promised outcomes. For example, replacing “either of two caches may answer” with “prefer edge cache, fall back to origin” can be a refinement if it does not change visible output.

Common pitfalls

  • Equating “same outputs” with “same behaviour”Ordering, errors, and timing can be observable.
  • Forgetting failure behaviourA refactor that changes retry/timeout outcomes changes behaviour for clients.
  • Using an observation model that is too strongIf you require equivalence on internal details, you forbid useful optimisations.
  • Using an observation model that is too weakIf you ignore key signals, you declare unsafe changes “equivalent”.
  • Not documenting the chosen notionTeams need a shared language for what “safe change” means.

Related notes