Clearer Sight, Fewer Lies: Oriented Pickup Preference Optimization for Multimodal Hallucination Mitigation

2026-06-29 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors found that large language models with visual input often hallucinate because they rely too much on guesses instead of trusting the visual evidence. They noticed that if the model focuses more strongly on the relevant visual parts it already sees, it makes better answers. To fix this, the authors created a training method called OPPO that teaches the model to prefer answers based on how strongly the visual evidence supports them. Their approach also includes detailed training tweaks and mathematical analysis, and tests show it improves accuracy and reduces hallucinations.

Multimodal Large Language ModelsHallucinationVisual EvidenceAttention MechanismAlignment ObjectiveTraining RegularizationVisual SensitivityFine-tuningModel Calibration

Authors

Xin Zou, Haolin Deng, Yibo Yan, Shuliang Liu, Zhiwei Jin, Chen Chen, Haonan Lu, Xuming Hu

Abstract

Multimodal Large Language Models (MLLMs) are prone to hallucination as their generation preferences are insufficiently calibrated to visual evidence, causing them to fall back on linguistic priors, rather than faithful grounding. In this work, we start from an empirical observation: when query-relevant visual evidence is explicitly strengthened using the model's own attention, generation becomes more accurate, suggesting that many failures do not arise solely from missing perception, but from an insufficient tendency to trust the evidence the model has already attended to. Motivated by this finding, we propose Oriented Pickup Preference Optimization (\texttt{OPPO}), an evidence-aware alignment objective that learns preferences over the strength of visual evidence, rather than only response quality. Concretely, \texttt{OPPO} contrasts the same faithful response under stronger, anchored, weaker-evidence views, turning naive visual preference into ordered visual-evidence alignment. We further combine this objective with fine-grained span-level and token-level regularization to stabilize the training. Besides, we provide a theoretical analysis showing that ordered evidence margins induce a positive lower bound on local visual sensitivity. Extensive evaluations across hallucination and general-purpose benchmarks demonstrate that \texttt{OPPO} consistently outperforms baseline methods.

View PDFOpen arXiv