Chains That See, Answers That Don't: A Multi-Aspect Evaluation Recipe for Forced Chain-of-Thought on Video-MME
2026-06-22 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionMachine Learning
AI summaryⓘ
The authors tested whether making a vision-language model explain its reasoning step-by-step (called forced chain-of-thought or CoT) helps it answer video questions better. They used three different tests, including checking if the reasoning changes when the video does, and how accuracy changes with or without explanations. They found that while the reasoning chains do depend on the video input, having the model give these explanations did not improve its multiple-choice question accuracy and sometimes made performance slightly worse. Their findings are specific to the model and dataset they studied, and they shared their data for others to verify.
vision-language modelvideo question answeringchain-of-thought (CoT)multiple-choice accuracydiagnostic evaluationvideo conditioningQwen2.5-VLVideo-MME datasetregex scoringstatistical evaluation
Authors
Zhichao Fan, Yanhang Li, Zexin Zhuang
Abstract
Forced chain-of-thought (CoT) is widely assumed to make vision-language models more reliable on video question answering. We propose a small three-probe evaluation recipe to test that assumption: paired accuracy across direct, CoT, answer-first, and no-video conditions; a counterfactual video-swap diagnostic over the CoT chains; and a four-rung visual-degradation ladder. Each probe is reported under both a strict and a permissive regex scorer, with multiplicity correction over a manuscript-declared primary family. Applied to Qwen2.5-VL on Video-MME subsets, the recipe returns a two-part finding. The CoT chains are strongly video-conditioned: swapping the input video collapses chain overlap and flips most final letters, the opposite of what a "boilerplate-chain" null would predict. Yet on the same data, forced CoT does not improve MCQ accuracy, and on the smaller 7B model it produces a small but statistically supported drop under a post-hoc primary scorer choice. We do not claim this generalizes beyond the Qwen2.5-VL / Video-MME instantiation; the raw responses and a single recomputation script will be released with the supplementary material so every number can be re-derived.