From Accuracy to Visual Dependence: Auditing and Filtering Modality Collapse in Traffic VideoQA
2026-06-29 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors studied how well Video Question Answering (VideoQA) models actually use video content to answer questions about traffic accidents. They found that some models do just as well or better when they only see text and no video, suggesting that the models rely on shortcuts in the text rather than visual clues. To measure this, they created new ways to check how much the models depend on the video and to identify questions that encourage shortcuts. Their work shows it is important to test if models truly understand visual information, especially in safety-related tasks.
Video Question AnsweringVision-Language Modelsvisual groundingtextual shortcutsbenchmark evaluationMM-AU datasetBlind GapVisual GainShortcut Score
Authors
Sena Korkut, María Alejandra Bravo Sarmiento, Sanghwan Kim, Zeynep Akata
Abstract
High benchmark accuracy does not guarantee genuine use of visual evidence. We study this problem in traffic accident Video Question Answering (VideoQA), where correct answers should depend on scene-specific visual evidence but may instead be inferred from textual shortcuts. Through an audit of four public benchmarks, we find that several recent open-weight Vision-Language Models (VLMs) perform competitively, and sometimes better, without video input. On the MM-AU benchmark, removing video consistently improves accuracy, and adding more frames further degrades performance. To quantify visual dependence, we introduce two dataset-level diagnostics: Blind Gap, measuring above-chance text-only performance, and Visual Gain, measuring the marginal benefit of adding video. We further propose an instance-level Shortcut Score that combines text-only confidence with visual necessity signals, enabling continuous, training-free filtering of shortcut-prone questions. The resulting subsets reduce shortcut bias and improve visual grounding. Our findings reveal large differences in grounding quality across benchmarks and show that visually grounded evaluation, not just high accuracy, is essential in safety-critical VideoQA.