Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection

2026-06-02Artificial Intelligence

Artificial Intelligence
AI summary

The authors studied how a common method for learning from text, which uses token-level uncertainty (entropy), doesn’t work well when reasoning about images. They found that some important visual tokens have low uncertainty and get ignored, making the method fail for visual tasks. To fix this, they created VEPO, a new approach that combines visual importance and token uncertainty to better guide learning. Their tests show VEPO works better than just using uncertainty alone, proving their idea is effective.

token entropyreinforcement learningvisual reasoningpolicy optimizationmultimodal learningcredit assignmentperceptual groundingsemantic reasoning
Authors
Senjie Jin, Peixin Wang, Boyang Liu, Xiaoran Fan, Shuo Li, Zhiheng Xi, Jiazheng Zhang, Yuhao Zhou, Tao Gui, Qi Zhang, Xuanjing Huang
Abstract
While token-level entropy is commonly recognized as effective for credit assignment in text-only reinforcement learning with verifiable rewards (RLVR), it remains unclear whether this mechanism still holds in visual reasoning. Our controlled study shows that this mechanism collapses in visual reasoning due to the omission of vision-sensitive tokens with naturally low entropy. Although existing multimodal RL methods increasingly acknowledge the importance of visual perception, they struggle to satisfy the inherent demand for interleaving precise perceptual grounding with semantic reasoning, either lacking systematic visual measurements or overlooking that token entropy primarily drives semantic exploration. To address this, we introduce VEPO (Vision-Entropy token-selection for Policy Optimization), an effective RL framework explicitly integrating visual sensitivity with token entropy via a principled multiplicative coupling, where VEPO redirects gradient credit toward tokens which are simultaneously visually grounded and highly informative. Extensive experiments demonstrate VEPO's leading performance, significantly outperforming the entropy-only baseline by 2.28 points at 7B-scale and 3.15 points at 3B-scale. Ablations further substantiate the soundness of our method.