VeriEvol: Scaling Multimodal Mathematical Reasoning via Verifiable Evol-Instruct

2026-06-22 • Artificial Intelligence

Artificial IntelligenceComputation and LanguageComputer Vision and Pattern RecognitionMachine Learning

AI summaryⓘ

The authors propose VeriEvol, a method to improve learning systems that solve visual math problems by focusing not just on making questions harder but also on ensuring answers are trustworthy. They separate the process into two parts: creating tougher questions from simpler ones and verifying answers with careful checks to avoid mistakes. This approach creates more reliable and larger data sets that help improve performance on math tasks more effectively than previous methods. They also provide all their materials so others can continue improving and checking their work.

reinforcement learningvisual mathematical reasoningdata verificationprompt difficultyhypothesis testingevolutionary algorithmsoffline verificationGRPOself-supervised learningdata scaling

Authors

Haoling Li, Kai Zheng, Jie Wu, Can Xu, Qingfeng Sun, Han Hu, Yujiu Yang

Abstract

Scaling reinforcement learning for visual mathematical reasoning requires more than generating harder questions: as data volume grows, the reward labels themselves must remain reliable. Yet existing data pipelines scale supervision while trusting the labeller, and policy-side methods assume the underlying answers are already correct. We instead treat scaling as a verifiable data-construction problem and decouple two axes before any policy update: prompt difficulty, expanded by route-specific evolution operators, and answer reliability, enforced by offline hypothesis-test falsification. We instantiate this as VeriEvol, an iterative framework with two extensible components: a type-aware evolution module that rewrites low-difficulty image-question seeds into harder, image-grounded prompts; and HTV-Agent, a verifier that accepts an answer only after multi-source counter-evidence has failed to refute it. The resulting verified data scales in volume, extends by adding evolution routes or verifier channels, and plugs directly into existing GRPO-style RL recipes. On a five-benchmark visual-math suite, scaling evolved SFT data from 10K to 250K samples raises the mean accuracy from 35.42 to 54.73; then, with backbone, SFT initialization, and GRPO recipe held fixed, VeriEvol adds a cumulative +3.88 over an un-evolved RL baseline, of which +1.82 comes from evolved prompts and +2.06 from the HTV-Agent verifier. We release the prompts, data, models, code, and the full verifier trace of every sample, so that downstream work can scale and audit the pipeline rather than only inspect its outputs.

View PDFOpen arXiv