Separate First, Fuse Later: Mitigating Cross-Modal Interference in Audio-Visual LLMs Reasoning with Modality-Specific Chain-of-Thought

2026-05-11Artificial Intelligence

Artificial IntelligenceSound
AI summary

The authors study how combining sound and vision sometimes confuses AI models, causing wrong answers because the input from one type misleads understanding of the other. To fix this, they suggest a method called Separate First, Fuse Later (SFFL), which processes audio and visual information separately before merging it. They also train the model to know which type of information is more helpful for each question. Their approach improves accuracy and makes the model less prone to mistakes when mixing information from sound and images.

audio-visual question answeringcross-modal interferencelarge language modelschain-of-thought reasoningreinforcement learningmodality-specific reasoningevidence fusionhallucination in AImodality preference
Authors
Xuanchen Li, Yuheng Lu, Chenrui Cui, Tianrui Wang, Zikang Huang, Yu Jiang, Long Zhou, Longbiao Wang, Jianwu Dang
Abstract
Audio and vision provide complementary evidence for audio-visual question answering, yet current audio-visual large language models may suffer from cross-modal interference: information from one modality misguides the interpretation of another, thereby inducing hallucinations. We attribute this issue to uncontrolled cross-modal interactions during intermediate reasoning. To mitigate this, we propose Separate First, Fuse Later (SFFL), an audio-visual reasoning framework designed to reduce cross-modal interference. SFFL enforces modality-specific chain-of-thought reasoning, producing separate audio and visual reasoning traces and integrating evidence for answering. We construct modality-preference labels via a data pipeline under different modality input settings. We use these labels as an auxiliary reward in reinforcement learning to encourage a instance-dependent preference for modality cues when answering. We further introduce a modality-specific reasoning mechanism that preserves modality isolation during the separated reasoning stage while enabling full access to cross-modal information at the evidence fusion stage. Experiments demonstrate consistent improvements in both accuracy and robustness, yielding an average relative gain of 5.16\% on general AVQA benchmarks and 11.17\% on a cross-modal hallucination benchmark.