Formal Verification Goes Mainstream: LLM-Driven Proofs and Vericoding - Formal methods notes

Formal verification has been the gold standard for software correctness for decades, and for most of that time it has been confined to a narrow band of critical systems: avionics, cryptographic libraries, microprocessor designs, safety-critical embedded controllers. The reason was not a lack of interest from the broader engineering community. It was cost. Writing a formal proof required deep expertise in proof theory, fluency in a proof assistant's tactics language, and the patience to spend weeks on a single function. That cost equation is changing.

Large language models can now generate proof sketches in Lean 4, complete partial proofs in Coq, and translate informal English specifications into formal ones with enough accuracy to be useful as starting points. The term "vericoding" has emerged to describe a workflow where code and proof are developed together, with LLMs handling the mechanical proof engineering that previously consumed most of the effort. This is not a solved problem. LLMs hallucinate proofs, miss edge cases, and sometimes produce formally valid but semantically vacuous statements. But the trajectory is clear: the barrier to entry for formal verification is dropping, and the engineering practices around it are evolving.

This connects directly to the specification and contract-based thinking that underpins reliable systems design. It also intersects with the protocol verification questions raised by multi-agent systems. The sections below cover why verification was hard, what LLMs change, what vericoding looks like in practice, the current tool landscape, the limitations that remain, and what this means for engineering teams.

Why verification was hard

The difficulty of formal verification was never primarily intellectual. The underlying ideas, expressing properties in logic and proving that programs satisfy them, are well understood and taught in graduate courses worldwide. The difficulty was practical.

Proof assistants like Coq, Isabelle/HOL, and Lean require the user to guide the proof at a level of detail that has no analogue in normal programming. A simple property like "this sorting function returns a permutation of its input that is in non-decreasing order" might take a few lines to state but hundreds of lines to prove. The proof involves lemmas about list structure, induction principles, and arithmetic facts that are obvious to a human but must be spelled out for the machine.

Tactic fluency Each proof assistant has its own tactic language for constructing proofs step by step. Learning these languages is a significant investment, comparable to learning a new programming paradigm. Coq's Ltac, Lean's tactic mode, and Isabelle's Isar each have distinct idioms and limitations. Expertise in one does not transfer cleanly to another.
Library bootstrapping A verified project needs verified libraries. If you want to prove properties about your data structures, you need formalised definitions and lemmas about the underlying mathematical objects. Building or finding these libraries was a major time sink, though projects like Mathlib for Lean 4 have improved this substantially.
Specification effort Before you can verify anything, you need a formal specification of what "correct" means. Writing good specifications is itself hard. An overly weak specification can be trivially satisfied by a broken implementation. An overly strong specification can be impossible to implement efficiently. Getting the specification right often requires as much expertise as writing the proof.
Maintenance burden Proofs are brittle. A small change to the implementation can break a proof in ways that require significant rework. In an industrial codebase where code changes daily, maintaining proofs alongside code was seen as prohibitively expensive outside high-value targets like cryptographic primitives.

These costs meant that verification was reserved for code where the cost of failure exceeded the cost of proof. For most software, testing and code review were "good enough" in economic terms, even if they provided weaker guarantees.

What LLMs change

LLMs trained on mathematical text and proof assistant code can perform several tasks that directly address the bottlenecks above.

The most immediately useful capability is tactic suggestion. Given a proof state, an LLM can suggest the next tactic step with reasonable accuracy, analogous to code completion but for proofs. The user guides the overall strategy while the mechanical steps of applying lemmas and handling routine cases are automated.

Beyond single steps, LLMs generate multi-step proof sketches. Given a theorem statement and context, a model can produce a draft proof that is sometimes complete and correct, sometimes structurally sound but with gaps. A complete proof saves all manual work. A proof sketch with gaps gives the user a structure to fill in rather than starting from blank.

Translation from natural language to formal specification is another high-value application. An engineer who can describe a property in English ("this function never returns a negative balance", "this protocol always terminates within three message rounds") can get a draft formal statement that they then review and refine. This lowers the specification barrier considerably.

What LLMs do not change is the fundamental requirement for soundness. A proof assistant checks every proof step against its kernel, and that kernel is trusted because it is small and well-audited. LLMs generate candidate proofs that the assistant then checks. If the proof checks, it is as trustworthy as any other machine-verified proof, regardless of how it was generated. If it does not check, it is rejected. This architecture, LLM as proof generator, proof assistant as checker, preserves the soundness guarantee while leveraging the LLM's ability to handle tedious mechanical work.

Vericoding: code and proof together

The term "vericoding" refers to a development workflow where the programmer writes code and its formal specification simultaneously, with LLM assistance for both. The code and the proof co-evolve rather than the proof being an afterthought.

In practice this means working in a language or framework that supports verified programming. Lean 4 is both a programming language and a proof assistant. F* generates verified low-level code. Dafny integrates pre/post conditions and loop invariants directly into the code syntax. In each case, the verification conditions are part of the source file, not a separate artefact.

A vericoding workflow proceeds as follows. The developer writes a function signature with its specification, writes the body, invokes the verifier, and uses LLM assistance to discharge the resulting proof obligations. If the LLM cannot complete a proof, the developer intervenes or revises the specification. Verification becomes incremental: you verify each function as you write it, the way you might run tests after each function. The LLM reduces friction enough that this fits into a normal development rhythm. The dynamic mirrors what AI coding tools as pair programmers have demonstrated in regular software development: the model handles mechanical work, the developer retains responsibility for judgment and design.

This is aspirational for most production codebases today. But for new code in domains where correctness matters (financial calculations, access control logic, serialisation and deserialisation, protocol implementations), vericoding is becoming practical. The key insight is that you do not need to verify an entire system. You verify the critical core and use conventional testing for the rest. This is the same strategy that the model checking community has long advocated: apply the strongest techniques where the risk is highest.

Current tools and research

The tool landscape for LLM-assisted verification is evolving rapidly. Several efforts deserve attention.

AlphaProof DeepMind's system demonstrated that LLMs combined with reinforcement learning can solve competition-level mathematical problems by generating Lean 4 proofs. While aimed at mathematics rather than software, the techniques transfer: the same proof-generation capabilities apply to verifying software properties expressed in Lean.
LLM-based tactic predictors Several research groups have trained models specifically for tactic prediction in Coq and Lean. These models are smaller and cheaper than general-purpose LLMs, and they can be run locally, which matters for organisations that cannot send proprietary code to external APIs. Proverbot9001 and Tactician for Coq are examples.
Copilot for proof assistants IDE integrations that provide inline proof suggestions are emerging for Lean 4 and Coq. These bring the familiar code-completion experience to proof development, making it accessible to developers who are not proof engineering specialists.
Verified-by-construction code generation Research into generating code that comes with a proof of correctness is advancing. The workflow is: provide a specification, let the LLM generate both code and proof, check the proof mechanically. If the proof checks, the code is guaranteed correct with respect to the specification. Early results show this working for small functions and algorithms.
Dafny and spec-first workflows Dafny's approach, writing specifications as part of the code, aligns naturally with LLM assistance. The specifications serve as prompts, and the LLM generates both the implementation and the verification annotations. Amazon's use of Dafny for verified components in AWS has demonstrated the industrial viability of this approach.

Limitations and trust

The enthusiasm around LLM-assisted verification must be tempered by clear-eyed recognition of what can go wrong.

LLMs hallucinate proofs. A model might generate a proof that looks plausible but contains a step that does not type-check. The proof assistant catches this, so soundness is not compromised, but it wastes time. In the worst case, the developer trusts the output without running the checker.

A subtler risk is vacuous proofs. An LLM might "prove" a property by weakening it until it is trivially true, or by introducing a contradictory assumption. The proof assistant accepts such proofs because they are technically valid. Detecting vacuity requires the developer to review assumptions and conclusions.

Specification quality remains a human responsibility. If the natural language description is incomplete, the formal spec will be too. The classic problem of writing the right specification is not solved by automation.

Trust in the tool chain is another consideration. The proof assistant's kernel is small and auditable, but the LLM and translation layers are large and opaque. If a bug causes the proof assistant to check a different theorem than intended, the guarantee is void.

What this means for practice

For engineering teams, the practical implications are incremental rather than transformative.

First, formal verification is becoming viable for a broader set of code. If your team writes financial logic, protocol parsers, access control rules, or serialisation code, verified implementations are now within reach for new code where you can choose the language and tooling.

Second, the specification habit matters more than the verification tool. Even if you never run a formal verifier, writing precise specifications, invariants, and contracts makes your code better. LLMs make it easier to formalise those specifications. Start with specifications. Verification will follow.

Third, treat LLM-generated proofs as drafts. Always run the checker. Always review the specification. Always look for vacuity.

Fourth, the economics are shifting. The cost of verified implementations is dropping, and the cost of correctness failures is rising. The crossover point, where verification is cheaper than the expected cost of bugs, is moving down the criticality spectrum. Code that was not worth verifying five years ago may be worth verifying today. LLMs are the catalyst making this shift happen faster than most practitioners expected.

Related notes

Model checking primerState exploration and property checking as a verification foundation.
Specs, invariants, and contractsThe specification layer that LLM-driven verification builds on.
MCP, A2A and ACP protocolsEmerging protocols that could benefit from automated formal verification.

Back to topic • Back to Notes index