“Testing gives you confidence. Proof gives you a guarantee.”
Problem. Code can be tested, but testing never exhausts all inputs—every passing test suite still leaves room for subtle logical errors and specification violations. Formal verification closes this gap by constructing machine-checkable proofs that an implementation satisfies its specification for all possible inputs. We study this problem in Lean 4, a modern interactive theorem prover, where the goal is to automatically generate such proofs for realistic programs.
Method. We propose a hierarchical proof search framework built around two stages: recursive lemma decomposition, which iteratively breaks a complex verification goal into simpler, independently provable subgoals until each is tractable, and iterative lemma completion, which solves each leaf subgoal through proof search with compiler feedback. A principled decomposition score drives both stages—guiding candidate selection at inference and serving as a dense reward signal during training.
Model & Release. We train a single unified model, Göedel-Code-Prover-8B, for both decomposition and completion via supervised initialization followed by hybrid reinforcement learning. We publicly release the model weights and an evaluation framework to support future research in automated code verification.
Results. Göedel-Code-Prover-8B achieves prove success rates of 68.8%, 54.0%, and 62.3% across three Lean 4 code verification benchmarks, for an overall rate of 62.0%—a 2.6× improvement over the strongest baseline. Among frontier reasoning models, the best performer (GPT-5.3-Codex) reaches only 18.5%; among neural provers, DeepSeek-Prover-V2-671B and Goedel-Prover-V2-32B achieve 21.8% and 21.5% respectively. With only 8B parameters, our model surpasses all of them, including provers 4–84× its size.
Note: Baselines use parallel whole-proof generation (Pass@128), while our framework employs search-based hierarchical inference—a fundamentally different inference paradigm enabled by our trained decomposition and completion pipeline. This comparison reflects end-to-end system-level performance. Even scaling open-source provers to significantly larger budgets shows diminishing returns (see paper for details), suggesting that the gap is not merely a matter of compute allocation.
This example asks the program to solve a simple task: given a list in which one integer appears exactly once and every other integer appears exactly twice, return the unique integer. For instance, given [2, 1, 3, 2, 3] the answer is 1. The verification goal is stronger than testing on a few examples: we want a machine-checkable proof that for every valid input list, the implementation always returns the correct answer.
Click any node to view its Lean 4 code. The tree auto-animates on first view.
-- Select a node in the proof tree above to see its Lean 4 code here.
Recursive Lemma Decomposition. Given a verification goal, the model recursively proposes sets of sub-lemmas together with constructive proof sketches showing that they imply the original goal, continuing until each sub-lemma is tractable. We score each decomposition on two axes: correctness (sub-lemmas must survive QuickCheck random testing—any counterexample discards the candidate) and effectiveness (measured via an operator-footprint metric that rewards decompositions that genuinely simplify the goal). This score doubles as the RL reward during training and the ranking criterion at inference.
Iterative Lemma Completion. Each surviving sub-lemma is proved independently. The model generates a candidate proof and submits it to the Lean 4 compiler; if it fails, the compiler's error message—type mismatch, unresolved goal, and so on—is fed back to the model for revision. This human-like edit–compile–fix loop continues until the proof passes or a budget is exhausted, and the same loop is replayed during both SFT and RL training.
Training Göedel-Code-Prover-8B proceeds in two stages. We first warm up the model with supervised fine-tuning on frontier-model-generated decomposition and completion trajectories, giving it a strong behavioral prior. We then switch to hybrid reinforcement learning: GRPO uses the decomposition score as a dense reward to push the model toward better proof decompositions, while high-quality completion trajectories are replayed as SFT signal to keep proof generation stable. Both gradients update the same shared policy. Sub-lemmas discovered during online rollouts are fed back into the problem set, creating a self-generated curriculum that grows harder as the model improves. During RL, a policy-first strategy is employed for completion: the model first attempts each lemma itself, and a frontier model is invoked only when it fails within a fixed attempt budget, ensuring that the policy learns from its own successful proofs rather than relying solely on external demonstrations.
Göedel-Code-Prover-8B uses a single unified model for both stages at inference time. Starting from the original theorem, it performs iterative recursive decomposition: for each currently unsolved goal, the model generates N parallel decomposition candidates, filters them with QuickCheck random testing, ranks survivors by the decomposition score, and recursively expands the best decomposition until a proof tree of manageable sub-lemmas is obtained or a budget is reached. Each resulting leaf sub-lemma is then proved independently via an iterative edit–compile–fix loop with the Lean 4 compiler, and the resulting proofs are assembled into a final verified proof of the original theorem.
We evaluate on three Lean 4 code verification benchmarks: Verina (189 problems), Clever (161 problems), and AlgoVeri (77 problems), totaling 427 verification tasks. Baselines include frontier reasoning models (Claude-Opus-4.6, Gemini-3-Flash, GPT-5.2-Pro, GPT-5.3-Codex, Deepseek-V3.2-Speciale) and neural provers (Kimina-Prover-72B, DeepSeek-Prover-V2-671B, Goedel-Prover-V2-32B, BFS-Prover-V2-32B).
Göedel-Code-Prover-8B achieves prove success rates of 68.8% on Verina, 54.0% on Clever, and 62.3% on AlgoVeri, yielding an overall rate of 62.0% across all 427 tasks β a 2.6× improvement over the strongest baseline. Notably, this is achieved with an 8B model, surpassing neural provers 4–84× its size (32B–671B). In addition, the built-in quickcheck mechanism disproves 23, 10, and 14 problems on Verina, Clever, and AlgoVeri respectively by producing concrete counterexamples.
Decomposition score vs. prove rate on Verina. The score reliably separates provable from unprovable instances (AUROC = 0.903).
Statistics of proved problems: mean and standard deviation for lemma count and proof length.
| Statistics | Verina | Clever | AlgoVeri | Total |
|---|---|---|---|---|
| Lemma count (mean) | 17.02 | 12.13 | 8.48 | 14.38 |
| Lemma count (std) | 11.79 | 10.57 | 10.97 | 11.97 |
| Proof length (mean) | 167.11 | 137.78 | 129.87 | 154.28 |
| Proof length (std) | 108.50 | 97.79 | 83.03 | 103.11 |
On average, each verified problem requires 8–17 auxiliary lemmas and over 130 lines of proof code, with the longest exceeding 680 lines. These statistics confirm that successfully verified problems demand deep, multi-step decomposition rather than simple tactic scripts.
By first decomposing verification goals into simpler sub-lemmas, the completion stage enjoys significantly improved inference-time scaling: prove success rate grows steadily with more iterations and larger pass@k across all benchmarks, with no sign of saturation. Decomposition itself also benefits from additional compute: the reduction rate steadily decreases with more iterations, producing progressively simpler sub-goals. This confirms that effective decomposition is the key enabler: it transforms intractable monolithic proofs into independently solvable pieces, allowing completion to scale efficiently with additional compute.
Website template adapted from Nerfies (CC BY-SA 4.0).