AI summaryⓘ
The authors studied how Chain-of-Thought (CoT) prompting, which guides models to explain their reasoning step-by-step, works in multimodal large language models that understand images and text. They found that CoT often makes these models perform worse because the models either decide answers too early or don’t fully use the visual information while explaining. Although fine-tuning with CoT helps a bit, it also makes models rely more on text and less on images. To fix this, the authors created Attentive-CoT, a training method that encourages models to keep looking at visual details longer before making a decision. This approach improved reasoning performance on several image-based tasks without changing the model structure.
Chain-of-Thought promptingMultimodal Large Language ModelsVisual reasoningFine-tuningAttention mechanismStep-wise reasoningVisual-token accessSupervised Fine-Tuning (SFT)Premature answer commitmentCounterfactual dependence
Authors
Sanchit Sinha, Guangzhi Xiong, Bohan Liu, Zhenghao He, Aidong Zhang
Abstract
The effectiveness of Chain-of-Thought (CoT) prompting in Multimodal Large Language Models (MLLMs) remains uncertain: across several visual reasoning benchmarks, CoT prompting often degrades performance compared to direct prompting. In this paper, we provide a systematic analysis of CoT behavior in three modern MLLM families across model scales on datasets requiring step-wise visual evidence. Our analysis identifies two recurring failure modes: premature answer commitment and limited direct visual-token access during rationale generation. We further find that standard CoT-style Supervised Fine-Tuning (CoT-SFT) can mitigate these issues only partially, while often increasing reliance on textual priors and reducing counterfactual visual dependence. Motivated by these findings, we propose Attentive-CoT (Att-CoT), an attention-guided fine-tuning objective that encourages CoT trajectories to delay answer commitment while maintaining sustained visual-token access. Att-CoT can be plugged into any CoT-SFT training run without architectural changes. Experiments on three visual reasoning benchmarks across six MLLMs show that Att-CoT enhances CoT performance over standard fine-tuning.