Semantic Noise Reduction via Teacher-Guided Dual-Path Audio-Visual Representation Learning
2026-04-09 • Sound
SoundComputer Vision and Pattern Recognition
AI summaryⓘ
The authors found that training audio-visual models by combining two tasks—matching sounds and images (contrastive alignment) and filling in missing parts (masked reconstruction)—in one step causes confusion because the model's parts meant for matching get mixed up with parts meant for reconstruction. To fix this, they created TG-DP, a method that separates these tasks into two paths and uses a teacher model to guide the matching part. This helps the model learn better connections between sounds and videos and improves how well it can find audio from video and vice versa without needing extra training. Their method led to improved performance on several audio-visual tests.
contrastive alignmentmasked reconstructionaudio-visual representation learningcross-modal alignmentteacher modelzero-shot retrievalAudioSetlinear-probemultimodal pretrainingsemantic robustness
Authors
Linge Wang, Yingying Chen, Bingke Zhu, Lu Zhou, Jinqiao Wang
Abstract
Recent advances in audio-visual representation learning have shown the value of combining contrastive alignment with masked reconstruction. However, jointly optimizing these objectives in a single forward pass forces the contrastive branch to rely on randomly visible patches designed for reconstruction rather than cross-modal alignment, introducing semantic noise and optimization interference. We propose TG-DP, a Teacher-Guided Dual-Path framework that decouples reconstruction and alignment into separate optimization paths. By disentangling the masking regimes of the two branches, TG-DP enables the contrastive pathway to use a visibility pattern better suited to cross-modal alignment. A teacher model further provides auxiliary guidance for organizing visible tokens in this branch, helping reduce interference and stabilize cross-modal representation learning. TG-DP achieves state-of-the-art performance in zero-shot retrieval. On AudioSet, it improves R@1 from 35.2\% to 37.4\% for video-to-audio retrieval and from 27.9\% to 37.1\% for audio-to-video retrieval. The learned representations also remain semantically robust, achieving state-of-the-art linear-probe performance on AS20K and VGGSound. Taken together, our results suggest that decoupling multimodal objectives and introducing teacher-guided structure into the contrastive pathway provide an effective framework for improving large-scale audio-visual pretraining. Code is available at https://github.com/wanglg20/TG-DP.