Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding
2026-06-08 • Multimedia
MultimediaArtificial IntelligenceMachine Learning
AI summaryⓘ
The authors developed Conan-embedding-v3, a method to help computers search across different types of data like text, images, videos, documents, and audio all in one system. They do this by first training separate expert models for each type of data and then combining them into a single model, which sometimes causes problems with audio data called Projector Drift. To fix this, they fine-tune part of the audio model without changing the combined backbone and use balanced training with all data types. Their approach improves retrieval performance across multiple data types in one unified model.
omni-modal retrievalembedding spacemodality specialistsDecoupled Specialist FusionprojectorProjector DriftProjector Recoveryfine-tuningmulti-modal rehearsalretrieval backbone
Authors
Shiyu Li, Zhiyuan Hu, Yifan Wang, Peiming Li, Zheng Wei, Yang Tang
Abstract
Omni-modal retrieval promises a single embedding space for text, image, video, document, and audio inputs, but building such a unified retriever is difficult since these modalities differ in data distribution, architecture, and optimization dynamics. In this work, we present Conan-embedding-v3, a decouple--fuse--recover framework for omni-modal retrieval. Conan-embedding-v3 first trains modality specialists independently and fuses their task vectors into a single dense backbone, a strategy we call Decoupled Specialist Fusion. We show that this fusion composes visual, video, and document retrieval capabilities, but also exposes a failure mode for projector-based modalities: when audio is attached through an external encoder and projector, fusing the backbone leaves the projector calibrated to the audio-specialist backbone, causing a large audio retrieval regression despite copying all audio-specific modules unchanged. We call this failure Projector Drift. To repair it, Conan-embedding-v3 applies Projector Recovery (i.e., full-parameter fine-tuning of the projector while keeping the backbone frozen) followed by balanced multi-modal rehearsal. The resulting model supports these retrieval pathways in one backbone, achieving 74.9 scores on MMEB while obtaining 55.61 on the 30-task MAEB audio suite.