Counterfactual Reasoning for Fine-Grained Evidence Disentanglement in VideoQA

2026-06-08 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionMachine Learning

AI summaryⓘ

The authors explain that video question-answering systems sometimes make mistakes because they rely on misleading clues rather than the true cause behind an answer. To fix this, they created a method called CREDiT that can tell apart the real visual reasons from unrelated distractions in videos. Their approach uses a special way to model cause and effect and tests 'what if' scenarios to better focus on important details. Tests show that this method helps the system give more accurate and reliable answers, especially in tricky sports videos.

VideoQAmultimodal modelscausal reasoningcounterfactual reasoningfeature-level interventionstructural causal modelevidence disentanglementcross-modality representationcausal inferenceconfounders

Authors

Zhou Du, Hamid Krim, Xiao Wu, Zhaoquan Yuan, Liangwei Li, Keisuke Fujii

Abstract

Recent advances in video multimodal models have significantly improved VideoQA performance. However, these systems often rely on spurious statistical correlations rather than answer-relevant causal evidence, resulting in unfaithful and brittle reasoning, especially in complex real-world scenarios. Existing methods either rely on cross-modality correlations, costly curated training resources, or insufficient causal assumptions and constraints, and typically operate at the time-interval level. As a result, they fail to explicitly disentangle causal visual cues from confounders and provide limited fine-grained evidence localization. To address this issue, we propose a Counterfactual Reasoning framework for fine-grained Evidence Disentanglement (CREDiT). CREDiT formulates the VideoQA process using a structural causal model and learns cross-modality representations that are explicitly decomposed into causal and non-causal components under independence and minimality constraints. To facilitate faithful disentanglement, we introduce feature-level causal interventions and construct counterfactual inputs that approximate causal effects while suppressing non-causal correlations. Extensive experiments on NExT-GQA, SportsQA, and SPORTU-video demonstrate that CREDiT consistently improves answer accuracy and reasoning reliability across both generic and complex sports scenarios, leading to more trustworthy VideoQA systems.

View PDFOpen arXiv