The Faithfulness Gap: Certifying Semantic Equivalence Between Natural-Language and Formal Mathematical Statements

2026-06-15Artificial Intelligence

Artificial IntelligenceMachine Learning
AI summary

The authors address a challenge in autoformalization, which is turning math written in normal language into exact formal statements checked by computers. The problem is not just translating fluently but making sure the formal version truly matches the intended meaning. They propose Bidirectional Provability Fingerprinting (BPF), a method that checks faithfulness by comparing the logical consequences of the formal statement with probes based on the original text. They introduce new tools like counterfactual probes to catch specific errors, a continuous faithfulness score, and ways to guide the autoformalizer to avoid mistakes. Their method shows much better detection of mismatches than previous checks and helps reduce errors in formalization.

autoformalizationformal proof assistantsfaithfulnessBidirectional Provability Fingerprintingcounterfactual probesformal verificationproof assistantsLean4drift detectionnatural language processing
Authors
Noor Islam S. Mohammad, Tamim Sheikh
Abstract
Autoformalization, translating natural-language mathematics into formal proof assistants, is bottlenecked not by translation fluency but by \emph{faithfulness}: a formal statement can typecheck and be provable, yet still encode a different theorem than the source intended. We introduce \emph{Bidirectional Provability Fingerprinting} (\bpf{}), a framework that certifies faithfulness by characterizing each candidate through its forward and backward consequence neighborhoods in the ambient theory and matching these against probes derived from the natural-language statement. We further introduce four novel components: (i) \emph{Counterfactual Probe Generation} (\cpg{}), a contrastive procedure that synthesizes probes targeting specific drift directions; (ii) the \emph{Equivalence Spectrum}, a continuous faithfulness score that replaces brittle binary verdicts; (iii) \emph{Adaptive Probe Budget Allocation} (\apba{}), an information-theoretic budget router; and (iv) \emph{Faithfulness-Guided Decoding} (\fgd{}), which uses \bpf{} signals as a reward during autoformalization. We prove a \emph{drift detection theorem} and a \emph{PAC-faithfulness} result establishing that the equivalence class of a natural language statement is learnable from $\mathcal{O}(\log(1/δ)/\varepsilon)$ probes under mild assumptions. We release \driftbench{}, a benchmark of $2{,}183$ NL/Lean~4 pairs with controlled drift labels across six subfields of mathlib4. \bpf{}\,+\,\cpg{} detects $89.6\%$ of drifted formalizations at a $3.0\%$ false-positive rate-against $41.2\%$ for typecheck and $63.3\%$ for LLM-judge baselines, and \fgd{} reduces the rate at which a state-of-the-art autoformalizer emits drifted statements by $47\%$. https://pmlrbd.github.io/BPF/