CFPO: Counterfactual Policy Optimization for Multimodal Reasoning

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionComputation and Language
AI summary

The authors found that large vision-language models often fail by ignoring pictures and relying too much on text when answering questions. To fix this, they created a new method called CounterFactual Policy Optimization (CFPO), which helps the model pay better attention to images by comparing what happens when important visual details are missing. Their approach improves the model's reasoning abilities without needing extra training data or rewards. Tests show CFPO works better than existing methods for combining vision and language understanding.

Large Vision-Language ModelsReinforcement LearningCounterfactual ReasoningCausal ConsistencyPolicy OptimizationCross-modal LearningChain-of-thought ReasoningVisual GroundingGRPODAPO
Authors
Zhangyuan Yu, Wanran Sun, Guangjing Yang, Xiaohu Wu, Qicheng Lao
Abstract
Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in multimodal reasoning. However, prevailing reinforcement learning (RL) paradigms lack explicit counterfactual enhancement and causal learning mechanisms. This fundamental deficiency results in severe grounding failures, manifesting as a tendency to ignore visual evidence in favor of language priors or exhibiting hallucination drift during long chain-of-thought reasoning. To address this root cause, we propose CounterFactual Policy Optimization (CFPO), a novel framework that enforces causal consistency between visual perception and textual reasoning. CFPO introduces a cross-modal counterfactual enhancement mechanism, which regularizes the policy by maximizing the discrepancy between the model's predictions and those from a counterfactual state where critical visual cues are suppressed. This approach seamlessly integrates with standard algorithms like GRPO and DAPO without requiring external reward models or additional supervision. Extensive experiments demonstrate that CFPO significantly improves reasoning fidelity, achieving consistent gains of 3.17%-6.25% over standard RL baselines and 1.32%-2.13% over the state-of-the-art perception-aware method (PAPO). Code is available at https://github.com/Raven-July/CFPO.