AI code generation is a production reality. Language models write functions, generate tests, scaffold entire modules, and complete complex refactoring tasks in seconds. The speed is extraordinary. The correctness is not. Generated code compiles, passes surface-level inspection, and often works for the common case. But it also introduces subtle bugs in edge cases, mishandles concurrency, breaks security assumptions, and violates protocol contracts in ways that are difficult to catch by reading the output.
This is not a criticism of the tools. It is a structural property of the generation process. A language model produces code by statistical continuation: given the prompt and context, what token sequence is most likely? That objective correlates with correctness but does not guarantee it. The model has no concept of an invariant, no model of memory safety, no understanding of protocol state machines. It has patterns learned from training data, and those patterns are remarkably useful but fundamentally different from the reasoning needed to ensure correctness.
Formal specifications offer a way to close this gap. A specification states what the code must do, in terms precise enough to be checked mechanically. When generated code has a specification to conform to, the gap between "looks right" and "is right" can be narrowed, sometimes eliminated. This article explores where generated code fails, how specifications help, and what practical workflows emerge from combining the two. It builds on the foundations of specifications and contracts and connects to the broader trend of LLM-assisted formal verification.
The speed and correctness tension
The appeal of AI code generation is obvious. A task that takes a developer thirty minutes takes a language model fifteen seconds. For boilerplate, CRUD operations, and well-documented API integrations, the output is often correct or close enough. Development velocity goes up.
The problem is that velocity and correctness pull in different directions. Speed encourages accepting output with minimal review. Correctness demands careful examination of edge cases, error handling, and assumptions. In practice, the review step is compressed. A developer scans the generated code, confirms it handles the obvious case, and moves on. The edge cases surface later as bugs in production.
What makes AI generation different from copy-paste coding is scale. A developer using AI generation might introduce dozens of unreviewed functions in a single session. The error surface grows proportionally. The question is not whether to use AI code generation, but what infrastructure and practice can restore correctness confidence in a high-velocity workflow.
Where generated code fails
Understanding the failure modes of generated code is a prerequisite for building effective guardrails. The failures cluster into predictable categories.
- Edge cases and boundary conditions Generated code typically handles the common case well because the common case dominates the training data. Empty inputs, maximum-size inputs, negative values, unicode edge cases, and off-by-one errors in loops are systematically underrepresented in the model's learned patterns. A sorting function might work perfectly for typical arrays but fail on empty arrays or arrays with duplicate elements.
- Concurrency and ordering Code that involves threads, locks, async operations, or shared mutable state is particularly error-prone when generated. The model may produce code that works under sequential execution but has race conditions under concurrent access. These bugs are hard to find by testing because they depend on timing, and they are hard to find by review because concurrent reasoning is cognitively expensive.
- Security assumptions Generated code may not sanitise inputs, may use insecure defaults, or may handle authentication tokens carelessly. The model learned from a training corpus that includes both secure and insecure code, and it has no preference for one over the other unless the prompt explicitly requests security considerations. Injection vulnerabilities, improper access checks, and information leakage are common failure classes.
- Protocol conformance When generated code implements a communication protocol or API integration, it may violate ordering constraints, omit required fields, or handle error responses incorrectly. A function that calls an API might not check the response status, might not handle rate limiting, or might not follow the required retry semantics. The model generates code that calls the API, not code that conforms to the API's [protocol contract](/notes/formal-methods/session-types-protocols/).
- Implicit assumptions Generated code often embeds assumptions that are not stated in the output. A function might assume its input is sorted, or that a file exists, or that a network call will succeed. These assumptions may hold in the developer's test environment but fail in production. Because the assumptions are implicit, they are invisible during review.
These failure modes share a common characteristic: they involve properties that the model was not optimising for. The model optimises for plausible next tokens given the context. Properties like memory safety, protocol conformance, and concurrency correctness are emergent (or not) from the token sequence, not explicit objectives.
Specifications as guardrails
A formal specification makes expected behaviour explicit. Instead of relying on generated code to implicitly satisfy requirements, you state the requirements in a form that can be checked.
At the simplest level, this means type signatures. In a language with a rich type system (Rust, Haskell, TypeScript in strict mode), a correct type signature rules out large classes of errors mechanically. Beyond types, preconditions and postconditions state what must be true before and after a function executes. Invariants state what must always be true about a data structure or system state: a balance never negative, a queue always FIFO, a session always in a defined state.
Protocol types constrain the sequence of messages in a communication. If generated code implements a client for a typed protocol, the session type ensures messages are sent and received in the correct order with the correct payload types.
The common thread is that specifications give generated code something external to conform to. This shifts the burden from human review (expensive, error-prone for subtle issues) to mechanical checking (cheap, reliable for properties that can be formally stated).
Spec-guided generation
The specification does not have to be written after the code is generated. It can guide the generation process itself.
In a spec-guided workflow, the developer writes the specification first: the function signature, preconditions, postconditions, invariants. This specification is included in the prompt to the language model. The model generates code that attempts to satisfy it. The specification is then checked mechanically.
This works particularly well with languages like Dafny, where specifications are part of the code syntax. The developer writes a method signature with requires and ensures clauses, the model generates the body, and the verifier checks conformance. If it fails, the feedback can be fed back to the model for refinement.
In languages without built-in verification, a similar workflow uses property-based testing. The developer writes QuickCheck-style properties before generating the implementation. The model generates the code, and property tests check it against thousands of random inputs. This is weaker than formal verification but stronger than example-based testing.
The key insight is that writing the specification first changes the developer's role. Instead of reviewing generated code to see if it is correct, the developer defines what correct means and lets tools check conformance.
Testing vs verification
Testing generated code and verifying generated code are different activities with different guarantees.
Testing checks code against a finite set of inputs. For generated code, the risk is that test inputs do not cover the edge cases where the code fails. The model and the developer may share the same blind spots.
Property-based testing improves coverage by generating inputs randomly. It catches a broader range of bugs than example-based testing, but it is still sampling, not proving.
Formal verification proves that code satisfies its specification for all inputs. If the proof checks, the guarantee is absolute, modulo trust in the proof checker and the specification.
In practice, the three approaches are complementary. Example-based tests catch regressions and serve as documentation. Property-based tests catch edge cases. Formal verification eliminates doubt for the properties that matter most. For generated code, the most cost-effective approach is often: generate, run property-based tests, then apply formal verification to the critical subset.
A practical workflow
Bringing these ideas together suggests a workflow for teams that use AI code generation and care about correctness.
Start with specifications. Before generating code, write type signatures, preconditions, postconditions, invariants, and protocol constraints. They can be type annotations, doc comments, property test definitions, or contract assertions. The important thing is that they are explicit and checkable.
Generate with specifications in context. Include the specification in the prompt. The model produces better code when it has clear constraints.
Check mechanically. Run the type checker, contract checker, property-based tests, and (where available) the formal verifier. Treat any failure as a rejection. Feed failures back to the model and regenerate rather than manually fixing generated code.
Review specifications, not implementations. The developer's primary review effort should be on the specification. Is the precondition strong enough? Does the postcondition cover edge cases? A correct specification checked against correct code gives high confidence. A weak specification gives false confidence.
Invest verification proportionally. Utility functions and configuration code can be covered by conventional testing. Business logic, security-critical code, and protocol implementations warrant stronger guarantees. Use specs, invariants, and contracts to delineate the boundary between "tested" and "verified" code.
This workflow is not theoretical. Teams using Dafny, Lean 4, and TypeScript with strict mode are already practising variants of it. The AI provides velocity. The specification provides direction. The mechanical checks provide confidence. The gap that remains is the properties no one thought to specify. The best mitigation is a culture of specification writing: treat every function's contract as a first-class artefact and expand it as you discover new failure modes. The AI generates the code. The engineer defines what correct means.
Related notes
- Specs, invariants, and contractsThe specification layer that gives generated code something to conform to.
- Session types and protocolsProtocol-level specifications applicable to generated communication code.
- Formal verification goes mainstreamLLM-driven proof techniques relevant to verifying generated code.