Fine-grained Fragment Retrieval in Multi-modal Long-form Dialogues
2026-06-03 • Computation and Language
Computation and LanguageComputer Vision and Pattern Recognition
AI summaryⓘ
The authors address the problem of finding meaningful parts of long conversations that mix text and images, rather than just single messages. They introduce Fine-grained Fragment Retrieval (FFR), which finds connected segments relevant to a topic either within one dialogue or across many dialogues. To do this, they create specialized models: F2RVLM for retrieval inside a single dialogue and FFRS, a two-step system for searching large dialogue collections efficiently. They also build new datasets to test these methods and show that their approaches work better than previous ones.
multi-modal dialoguefragment retrievalreinforcement learningcurriculum samplingembedding modeldialogue corpussemantic coherencevector databasemulti-objective rewardsinformation retrieval
Authors
Hanbo Bi, Zhiqiang Yuan, Chongyang Li, Qiwei Yan, Zexi Jia, Jiapei Zhang, Xiaoyue Duan, Yingchao Feng, Jinchao Zhang, Jie Zhou
Abstract
With the widespread adoption of multi-modal communication platforms, long-form dialogues interleaving text and images have become increasingly common. Users often need to retrieve coherent dialogue fragments related to specific topics, rather than isolated utterances. We propose Fine-grained Fragment Retrieval (FFR), which locates semantically relevant multi-utterance, multi-image fragments in multi-modal long-form dialogues. We explore two settings: (1) FFR within Single-Dialogue, retrieving fragments from a given dialogue; and (2) FFR within Dialogue Corpus, retrieving from a large-scale corpus for open-domain scenarios. For (1), we introduce F2RVLM, a generation-based retrieval model trained with reinforcement learning, using multi-objective rewards and difficulty-aware curriculum sampling to enhance fragment coherence. For (2), we develop FFRS, a two-stage system combining offline fragment-level indexing with online retrieval. Specifically, each dialogue is decomposed into minimal semantic fragments encoded by a Fragment Embedding Model (FEM) into a vector database; at inference, FEM rapidly recalls Top-K candidates, and F2RVLM performs fine-grained reasoning to identify the most relevant sub-content. To support FFR, we construct MLDR, the longest multi-modal dialogue retrieval dataset to date, and a WeChat-based real-world test set. Experiments on both benchmarks demonstrate that F2RVLM and FFRS consistently achieve superior performance across single-dialogue and corpus-level FFR.