Look Again Before You Abstain:Budgeted Conformal Evidence Acquisition for Reliable Vision-Language Model

2026-06-15Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors study how large vision-language models (LVLMs) often make incorrect visual claims they can't back up, which they call hallucination. To avoid errors, models can choose to abstain from answering when unsure, but this leads to many missed answers. They propose Budgeted Conformal Evidence Acquisition (BCEA), which lets the model either answer, abstain, or gather extra visual details like zooming in, all within a limited budget. They find that careful recalibration after acquiring extra evidence keeps error rates low and improves how often the model confidently answers. Their method works well on several benchmarks and open-source models.

Large Vision-Language ModelsHallucinationSelective PredictionConformal PredictionCalibrationAbstentionVisual Evidence AcquisitionPOPE BenchmarkCOCO DatasetSpatial-Relation Claims
Authors
Jian Xu, Delu Zeng, John Paisley, Qibin Zhao
Abstract
Large vision-language models (LVLMs) hallucinate: they assert visual details that the image does not support. A principled remedy is selective prediction with a distribution-free guarantee-verify each claim and abstain when the claim is not grounded, so that the hallucination rate among asserted claims is provably bounded. We show, however, that this guarantee is bought at a brutal price: to keep the hallucination rate below $5\%$ on a balanced object-existence benchmark, a state-of-the-art conformal filter must abstain on more than $80\%$ of claims. We argue that abstention is wasteful when more visual evidence is cheaply available, and introduce Budgeted Conformal Evidence Acquisition (BCEA), which replaces the binary answer/abstain decision with a three-way choice: answer, abstain, or acquire additional visual evidence by re-examining the image (zooming, cropping, or applying a claim-specific intervention) under a bounded compute budget. We make two observations. First, acquisition that is plugged naively into a calibrated filter breaks the statistical guarantee -- realized risk overshoots the target by up to $17$ points -- because the acquisition step destroys the exchangeability that conformal calibration relies on. Second, folding the entire acquisition policy into the score function and re-calibrating on post-acquisition scores \emph{restores} the finite-sample guarantee while still recovering coverage. BCEA further uses structured, claim-type-specific interventions. Across the POPE benchmark and COCO-constructed existence and spatial-relation claims, on four open VLMs, BCEA controls the hallucination rate at the target level and consistently improves coverage over a guaranteed-abstention baseline.