Improving Long-Context Retrieval with Multi-Prefix Embedding

2026-06-22Information Retrieval

Information Retrieval
AI summary

The authors address the problem of searching long documents efficiently. They propose Multi-Prefix Embedding (MPE), which breaks a document into chunks but processes it all together to keep context and create multiple embeddings from different parts. This method balances detail and storage needs better than previous single or multi-vector methods and works well in tests. It also helps identify which parts of a document are most relevant to a search. Overall, the authors show MPE performs competitively while being practical for long document retrieval.

Long-context retrievalSingle-vector embeddingsMulti-vector embeddingsMulti-Prefix Embedding (MPE)Chunk-level MaxSim matchingEOS tokenCausal forward passDocument relevance labelsSource attribution
Authors
Zhenglin Yu, Xueguang Ma, Shengyao Zhuang, Zhichao Xu, Luyu Gao, Crystina Zhang, Jimmy Lin
Abstract
Long-context retrieval exposes a tension: single-vector embeddings lose fine-grained detail, while token-level multi-vector methods incur prohibitive storage. We propose Multi-Prefix Embedding (MPE), which partitions a document into chunks separated by EOS tokens, encodes the full sequence in a single causal forward pass, and extracts one embedding at each prefix boundary. MPE retains cross-chunk context, enables chunk-level MaxSim matching, and trains with only document-level relevance labels. Experiments on MLDR-en, BrowseComp-Plus, and LongEmbed show that MPE is competitive with or outperforms single-vector, independent-chunk, and multi-vector baselines, while providing a natural source attribution mechanism for locating evidence chunks.