Reasoning without Gold Standards: A Proxy-Judge Theory of Autoformalization
2026-06-08 • Computation and Language
Computation and Language
AI summaryⓘ
The authors address the challenge of evaluating complex reasoning tasks where exact answers are hard to verify, using autoformalization as an example. They propose a new method that judges outputs based on multiple detailed properties rather than a single exact match, allowing the system to identify and fix specific mistakes iteratively. This approach improves performance on several reasoning benchmarks and offers a way to refine outputs even when exact answers are unavailable. Their method also provides theoretical insights into how the refinement process converges over time despite some noise in judgments.
autoformalizationcomplex reasoningproxy judgmentformal verificationiterative refinementbenchmark evaluationintrinsic gapinformation retrievalnoise tolerancestructured proxies
Authors
Lei Xu, Xin Quan, André Freitas
Abstract
Complex reasoning tasks increasingly require systems to produce outputs whose correctness cannot be judged by exact match against a single reference. Autoformalization (AF) is a representative example; it asks a model to translate informal mathematical or logical reasoning into a formally checkable object, yet expert-validated formalizations do not scale beyond toy cases and a single informal argument can admit many valid formal renderings. Progress therefore depends on whether partial, structured proxies can substitute for exact references. We introduce a reference-free proxy-judge framework for AF that replaces gold-standard matching with a vector of per-axis property checks. The framework organizes the proxy along three structural scopes that cover global properties of the elicited object, per-module properties internal to its sub-components, and cross-domain properties that re-align it to the informal source, and aggregates each axis into a verdict vector. The vector drives a reflective refinement loop in which a violated coordinate routes the controller to a matching repair target, so each iteration changes only what is judged wrong. Under bounded judge noise, the expected intrinsic gap contracts geometrically to a noise-dependent plateau. Across seven formalization backbones on miniF2F, ProofNet, e-SNLI, and ProntoQA, refinement consistently lifts Pass Rate over the single-shot ICL baseline, and the per-axis proxy outperforms a matched scalar proxy on benchmarks where the baseline has room to improve. Structured proxy judgments therefore provide both a practical refinement signal and a theoretical handle on convergence when exact references are unavailable.