Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

2026-06-15 • Computation and Language

Computation and LanguageArtificial IntelligenceComputer Vision and Pattern Recognition

AI summaryⓘ

The authors study how vision-language systems that answer questions using both images and retrieved text struggle to use information from long text passages. While previous research found that pure text models forget the middle of long contexts, these authors discovered in vision-language models that information at the start of the retrieved passage is much more helpful than information at the end. They pinpointed this problem to the way the models process the prompt and showed that fixing it requires changes on the model's side, not just better retrieval of passages. Their work provides a new testing method to help improve such systems in the future.

Knowledge-based Visual Question AnsweringVision-language ModelsLong-context Language ModelsLost-in-the-middle EffectPrompt Position BiasMultimodal ModelsRecall@kWikipedia-scale Knowledge BaseInstruction-tuned Reader

Authors

Jieyuan Liu, Jianyang Gu, Shijie Chen, Jefferson Chen, Zhen Wang

Abstract

Knowledge-based visual question answering (KB-VQA) lets vision-language systems answer questions that exceed their parametric knowledge by conditioning a reader on passages retrieved from a Wikipedia-scale knowledge base. In pure-text long-context LLMs, retrieved-context use follows the U-shaped "lost-in-the-middle" effect of Liu et al. (2024): information at the start and end of context is used, the middle is lost. Whether this transfers to deployed multimodal KB-VQA is open. To close this gap, we design the first controlled probe of reader-side position dependence in multimodal KB-VQA: a gold-position protocol in which only the gold passage's prompt slot varies within question. We run it on three open-source 7B/8B VLM readers and two KB-VQA benchmarks at k up to 20. The shape flips from U to primacy: gold-at-first beats gold-at-last by 16 to 26 points on every reader-by-benchmark cell, an effect we call "Lost at the End". Three targeted ablations narrow the cause: a text-only control shows the multimodal setting amplifies an already-present text-mode primacy 2.2 to 4.5 times, and image-position and distractor-shuffle ablations together pin the locus to prompt slot 0 of the instruction-tuned reader. On a frozen reader, three retrieval-side fixes (MMR, oracle reranking, rank-based reordering) all leave the gap intact (no separable improvement). Our findings indicate that recall@k is the wrong metric for deployed KB-VQA and that closing the gap requires reader-side intervention; we release our protocol as a controlled instrument for evaluating such interventions.

View PDFOpen arXiv