OmniVerifier-M1: Multimodal Meta-Verifier with Explicit Structured Recalibration

2026-05-27 • Computation and Language

Computation and LanguageArtificial IntelligenceComputer Vision and Pattern RecognitionMachine Learning

AI summaryⓘ

The authors study how to better check if large AI models that understand images and text are making correct predictions. They find that using clear, rule-based outputs like boxes on images works better than old-fashioned text explanations for verification. They also show it’s more effective to train the model separately for judging correctness and for generating detailed feedback. Using these ideas, they build a system called OmniVerifier-M1 that can spot errors in images more accurately and even fix itself on small parts of an image.

multimodal modelsmeta-verificationreinforcement learningsymbolic outputsbounding boxesrule-based rewardsfoundation modelserror localizationagentic generationself-correction

Authors

Xinchen Zhang, Bowei Liu, Jiale Liu, Chufan Shi, Yizhen Zhang, Junhong Liu, Youliang Zhang, Zhiheng Li, Yujiu Yang, Ling Yang

Abstract

Visual outcomes are increasingly central to multimodal large language models, making reliable and fine-grained verification essential for scaling generalist foundation models. In this work, we investigate multimodal meta-verification, which leverages verifier-generated rationales rather than decision-only signals, and explore how to effectively incorporate meta-verification feedback into multimodal verifier training. We identify two key findings. First, symbolic verifier outputs (e.g., bounding boxes) outperform textual explanations as meta-verification rationales, enabling efficient rule-based reinforcement learning rewards while avoiding reliance on model-based rewards from auxiliary judge models. Second, decoupling reinforcement learning objectives for binary judgment and meta-verification substantially outperforms joint reward optimization, due to intrinsic differences in output structure and learning dynamics. Based on these insights, we train OmniVerifier-M1, a generalist visual verifier leveraging symbolic meta-verification and decoupled reinforcement learning. OmniVerifier-M1 provides robust verification and fine-grained error localization, and further enables M1-TTS, a verifier-driven agentic generation system achieving dynamic region-level self-correction. This approach paves the way for more reliable, interpretable, and fine-grained multimodal verification, supporting safer and more controllable foundation model deployment.

View PDFOpen arXiv