AxDafny: Agentic Verified Code Generation in Dafny

2026-06-30Artificial Intelligence

Artificial Intelligence
AI summary

The authors focus on improving how a computer model writes code in Dafny, a language that needs both the code and its proof of correctness. They created AxDafny, a system that fixes and improves code by adding necessary checks until it works properly. They also developed a big set of 250 test problems called LCB-Pro-Dafny to measure how well systems do at this task. Their method outperforms previous approaches in making verified, correct Dafny programs. They also point out that passing verification and running correct tests are different ways to check the code.

Dafnyagentic code generationformal verificationproof artifactsinvariantstermination argumentsbenchmarkverification successcode repairGPT-5.5
Authors
Benjamin Breen, Austin Letson, Borja Requena Pozo, Leopoldo Sarra
Abstract
We study agentic code generation in Dafny, where a model must generate both executable code and the proof artifacts for verification. We present AxDafny, a verifier-guided repair framework that iteratively generates implementations, invariants, assertions, and termination arguments. We also introduce LiveCodeBench-Pro-Dafny (LCB-Pro-Dafny), a benchmark of 250 competition-style programming problems translated into Dafny with formal specifications and a verifier-based evaluation harness. On LCB-Pro-Dafny, AxDafny substantially improves verification success over baseline GPT-5.5 performance. On DafnyBench, AxDafny achieves 92.7\% verification success, outperforming the strongest previously reported proof-hint baseline by 6.5 percentage points. Lastly, we show that verification success and runtime test performance measure different aspects of generated code.