Autonomous AI agents are entering production systems at a pace that outstrips the tooling available to observe and govern them. An agent that can browse the web, call APIs, write code, and make sequential decisions based on intermediate results is not a service in the traditional sense. It is a non-deterministic actor with a dynamic execution plan, and it introduces failure modes that existing observability and governance infrastructure was not designed to handle.
This is a continuation of the challenges explored in the context of multi-agent architectures and hybrid AI pipelines, but with a sharper focus on what happens once agents operate with meaningful autonomy. The gap between deploying an agent and operating it safely is where most organisations underinvest. The cost of that underinvestment surfaces as incidents that are difficult to diagnose, expensive to remediate, and hard to prevent from recurring.
The sections below cover what distinguishes agents from conventional services, the new failure modes they introduce, the observability requirements specific to agent systems, governance patterns that constrain agent behaviour without eliminating its value, the challenges of audit trails and reproducibility, and practical recommendations for teams deploying agents today.
What makes agents different from services
A conventional service receives a request, executes a deterministic (or near-deterministic) code path, and returns a response. The set of possible behaviours is bounded by the code. Observability tools are designed around this model: trace a request, measure latency at each hop, log structured events, alert on error rates.
An agent differs in several ways that matter operationally.
- Non-determinism is intrinsic The same input may produce different reasoning paths and different outputs on successive invocations. This is not a bug. It is the mechanism by which agents handle novel situations. But it means that replaying an input does not reproduce the behaviour, which undermines standard debugging workflows.
- Dynamic tool use An agent decides at runtime which tools to invoke and in what order. A code-generation agent might call a search API, then a file-read tool, then a code-execution sandbox, then search again. The execution graph is not known in advance. It is constructed during the run.
- Cascading decisions Each step in an agent's reasoning depends on the results of previous steps. An error or hallucination early in the chain can propagate, causing the agent to pursue a line of reasoning that diverges further from useful output with each step. Services do not compound errors this way because they do not chain decisions internally.
- Context window as a resource An agent's context window is a finite resource that fills as the agent works. Unlike memory or CPU, which can be provisioned elastically, the context window has a hard limit. When it fills, the agent either loses information or fails. This is a resource exhaustion mode with no direct analogue in service architectures.
These differences do not make agents unmanageable. They make agents a different category of component that requires purpose-built operational infrastructure.
New failure modes
Agent systems exhibit failure modes that are absent or rare in conventional distributed systems. Understanding them is a prerequisite for building effective observability and governance.
- Hallucination loops The agent generates a plausible but incorrect intermediate result, then reasons further based on that result, reinforcing the error. In a multi-step workflow, this can produce outputs that are internally consistent but factually wrong. The loop is hard to detect without external validation at intermediate steps.
- Unbounded tool invocation An agent tasked with "find the answer" may call a search tool dozens of times, each time slightly rephrasing the query, without recognising that the information is not available. Without a call budget, this burns tokens and time. In a system with real-world side effects (sending messages, creating records), unbounded invocation creates real damage.
- Context window exhaustion A long-running agent accumulates tool outputs, intermediate reasoning, and error messages in its context. Eventually the window fills. The agent may silently drop early context, leading to incoherent decisions, or it may fail outright. Either outcome is difficult to diagnose after the fact because the context at the point of failure is often not fully preserved in logs.
- Prompt injection via tool output An agent that reads external content (web pages, database records, API responses) is exposed to adversarial input. A malicious payload in a tool output can redirect the agent's behaviour. This is an injection attack, structurally similar to SQL injection, but targeting the reasoning layer rather than a query parser.
- Capability drift When the underlying model is updated (by the provider, not by the operator), the agent's behaviour may change. A prompt that worked reliably with one model version may produce different tool-call patterns or different output formats with the next. This is a silent regression that standard integration tests may not catch because the tests themselves are evaluated against probabilistic criteria.
These failure modes share a property: they are difficult to detect from outside the agent. The system appears to be working (the agent is responding, tools are being called) until the final output reveals a problem. This is why observability must be deeper for agents than for services.
Observability requirements for agent systems
Standard distributed systems observability provides traces, metrics, and logs. Agent systems need all of these, plus additional signal types that capture the reasoning process.
- Decision traces A record of each reasoning step: the input to the LLM, the output, the tool call decision, the tool result, and the next input. This is more granular than a distributed trace span. Each LLM invocation within an agent's run should be a distinct event with its full input and output preserved.
- Tool invocation logs Every tool call must be logged with its arguments, return value, latency, and cost. This serves both debugging and governance. If an agent made an unexpected API call, the log must show exactly what it called and why (the preceding reasoning).
- Token and cost tracking Each LLM call consumes tokens, and tokens cost money. Per-step token tracking enables cost attribution to specific user requests, specific agents, and specific reasoning paths. It also provides early warning for runaway loops: a sudden spike in token consumption within a single run is a strong signal that something has gone wrong.
- Context window utilisation Tracking how full the context window is at each step helps detect exhaustion before it causes failures. A dashboard showing context utilisation over the course of a run reveals whether the agent's workflow is sustainable or whether it will reliably hit the limit on longer tasks.
- Outcome validation Where possible, the agent's output should be validated against structural or semantic criteria before being returned to the user or passed to the next system. This is not traditional observability but it serves the same purpose: catching problems before they propagate.
The ordering of events matters. Reconstructing an agent's decision sequence requires accurate timestamps and causal ordering. If the observability pipeline reorders events, the decision trace becomes misleading.
Governance patterns
Governance for agent systems means constraining what an agent can do without eliminating the flexibility that makes agents useful. The balance is delicate. Too little governance and the agent causes incidents. Too much and it is a rigid workflow engine with extra latency.
- Approval gates Before executing high-impact actions (sending an external communication, modifying a production database, committing code), the agent pauses and requests human approval. This is a synchronous governance checkpoint. It adds latency but prevents irreversible errors. The design challenge is classifying which actions are high-impact. Agents that dynamically compose tool calls may produce high-impact combinations from individually low-impact tools.
- Budget limits Hard caps on token consumption, tool invocations, wall-clock time, and monetary cost per run. When a limit is reached, the agent is stopped and the partial result is returned with an explanation. Budget limits are the simplest and most effective governance mechanism. They bound the blast radius of every failure mode listed above.
- Scope constraints Restricting which tools an agent can access, which data it can read, and which actions it can take. This is the principle of least privilege applied to agents. A summarisation agent should not have access to a code-execution sandbox. A data-retrieval agent should not have write access to the database. Tool access should be granted per agent, not globally.
- Output validation policies Automated checks on agent outputs before they are delivered. These can be structural (JSON schema validation), semantic (does the output answer the question?), or policy-based (does the output contain personally identifiable information that should have been redacted?). Validation policies act as a final defence layer.
- Kill switches The ability to immediately halt a running agent or disable a class of agents across the system. This is the operational equivalent of a circuit breaker, but applied at the agent level. It must be fast (seconds, not minutes) and must not depend on the agent cooperating with the shutdown request.
These patterns are not novel in distributed systems. Rate limiting, circuit breakers, access control, and input validation are standard practice. The difference is that agents require these mechanisms to be applied to the reasoning layer, not just the network and data layers.
Audit trails and reproducibility
Regulated industries require audit trails. If an agent made a decision that affected a customer, a regulator may ask why. Answering that question requires reconstructing the agent's reasoning at the time of the decision, with the exact inputs, model version, prompt version, and tool outputs that were in play.
This is hard for several reasons. The model itself is opaque. You can record the input and output but not the internal computation. The reasoning chain is emergent, constructed at runtime, and may include steps that the agent itself did not explicitly articulate. Tool outputs may have changed since the decision was made (a web page was updated, a database record was modified). And non-determinism means that even with identical inputs and model version, you may not reproduce the exact same output.
Practical approaches to managing this:
Record everything at the boundary. Every LLM call, every tool invocation, every intermediate result. Storage costs are real but manageable compared to the cost of an audit failure. Use immutable storage for audit-critical records.
Version all components. Model version, prompt version, tool definitions, and system prompt must be recorded with each run. If any of these changed between the decision and the audit, the discrepancy must be flagged.
Accept probabilistic reproducibility. You may not reproduce the exact output, but you can demonstrate that the system was configured correctly, the inputs were valid, the reasoning chain was plausible, and the governance constraints were active. This is the standard that most regulators will hold you to, because they understand that stochastic systems do not behave identically on replay.
Build reproducibility tests that verify statistical properties rather than exact outputs. If the agent should classify a document as "sensitive" at least 95% of the time, test that. If it should never call a particular tool for a particular input class, test that as a hard constraint.
Practical recommendations
For teams deploying agentic AI systems today, the following practices reduce risk without requiring bespoke infrastructure.
- Start with budget limits Before investing in sophisticated governance, set hard caps on tokens, tool calls, and cost per run. This single mechanism prevents the worst failure modes (unbounded loops, runaway costs) and buys time to build more nuanced controls.
- Instrument the reasoning layer Add observability at the LLM call level, not just the service level. Every call to the model should emit a structured event with input, output, token count, and latency. This is the foundation for debugging, cost attribution, and audit trails.
- Treat prompts as code Version prompts in source control. Review changes. Test prompt updates against a regression suite before deploying. Prompt changes are the most common cause of agent behaviour changes, and they should receive the same rigour as code changes.
- Isolate tool access per agent Apply least privilege. Each agent gets access to exactly the tools it needs. Review tool grants as part of the deployment process. If an agent can reach an external API, someone should have explicitly approved that access.
- Plan for model upgrades When the model provider ships a new version, test your agents against it in staging before adopting it in production. Monitor for behavioural drift in the first days after an upgrade. Pin model versions where the provider allows it.
- Build incident response playbooks Agent incidents are unfamiliar territory for most operations teams. Write playbooks that cover the agent-specific failure modes: how to detect a hallucination loop, how to interpret a decision trace, how to roll back an agent to a previous prompt version, how to invoke the kill switch.
Agentic AI is not a departure from distributed systems principles. It is an intensification of them. The same concerns about failure boundaries, observability, causal ordering, and formal specification of expected behaviour apply, with the added challenge that the components are probabilistic. Teams that approach agent deployment with the same discipline they apply to any other distributed system will be well positioned. Teams that treat agents as a different category, exempt from operational rigour, will learn the same lessons the hard way.
Related notes
- Observability for distributed systemsSignals that help localise latency and failure across services.
- Time, clocks, and orderingWhy ordering matters when reconstructing agent decision sequences.
- From monolithic to multi-agentDecomposition patterns and failure boundaries for task-specific agents.