See Only When Needed: Context-Aware Attention Intervention for Mitigating Hallucinations in LVLMs
2026-06-29 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors address a problem in vision-language models where the system sometimes invents objects that aren't really there, called hallucinations. They created a method named Context-aware Attention Intervention (CAI) that carefully decides when and where the model should pay attention to images during text generation, focusing only on uncertain parts to improve accuracy. This approach improves how well the model links words to the correct visual parts without hurting the natural flow of the generated text. Their tests on various models and tasks show that CAI reduces hallucinations better than previous methods, and it can work alongside other techniques if needed.
Large Vision-Language ModelsObject HallucinationAttention MechanismVisual GroundingContext-aware AttentionEntropy GatingInference-time InterventionToken-specific RelevanceContrastive DecodingKL-divergence
Authors
Yuqing Lei, Wenbo Lyu, Yingjun Du, Xiantong Zhen, Cees G. M. Snoek, Ling Shao
Abstract
Large Vision-Language Models (LVLMs) excel at multimodal tasks but remain prone to object hallucinations. Prior training-free remedies often uniformly strengthen visual signals, which may also amplify irrelevant regions and introduce spurious evidence, harming fluency. We propose Context-aware Attention Intervention (CAI), a training-free inference-time mechanism that enforces a see only when needed principle via two-axis selectivity: where to look and when to intervene. At each decoding step, CAI derives token-specific visual relevance from early-layer representations to localize semantically aligned regions, and applies a conservative, entropy- and depth-gated attention tilt only for uncertainty-spiking tokens in deeper layers where visual grounding degrades, leaving confident tokens and irrelevant regions largely unchanged. This targeted intervention strengthens visual grounding while preserving linguistic fluency, and it yields consistent improvements even without contrastive decoding, which remains optional as an auxiliary bias-suppression module. Extensive experiments across multiple LVLM backbones and benchmarks show that CAI achieves state-of-the-art hallucination mitigation, and our analysis characterizes CAI as a KL-minimal attention reweighting with bounded interference under inactive gates or small tilts. Code is available at https://github.com/Iris1946/CAI.