Route Before Retrieve: Activating Latent Routing Abilities of LLMs for RAG vs. Long-Context Selection

2026-05-11Computation and Language

Computation and Language
AI summary

The authors address the problem of efficiently choosing between two ways for large language models to handle long texts: retrieval-augmented generation (RAG) or using the full long context (LC). They propose a new method called Pre-Route that uses simple metadata to actively decide which approach to use before generating an answer, making the process both explainable and cost-effective. Their experiments show that language models can be guided to make these routing decisions well, this reasoning can be transferred to smaller models, and their method outperforms existing alternatives.

large language modelscontext windowretrieval-augmented generationlong-context reasoningrouting frameworkstructured promptsdistillationcost-effectivenessmetadatamulti-source reasoning
Authors
Yiwen Chen, Kuan Li, Fuzhen Zhuang, Deqing Wang, Zhao Zhang, Liwen Zhang, Yong Jiang, Shuai Wang, Minhao Cheng
Abstract
Recent advances in large language models (LLMs) have expanded the context window to beyond 128K tokens, enabling long-document understanding and multi-source reasoning. A key challenge, however, lies in choosing between retrieval-augmented generation (RAG) and long-context (LC) strategies: RAG is efficient but constrained by retrieval quality, while LC supports global reasoning at higher cost and with position sensitivity. Existing methods such as Self-Route adopt failure-driven fallback from RAG to LC, but remain passive, inefficient, and hard to interpret. We propose Pre-Route, a proactive routing framework that performs structured reasoning before answering. Using lightweight metadata (e.g., document type, length, initial snippet), Pre-Route enables task analysis, coverage estimation, and information-need prediction, producing explainable and cost-efficient routing decisions. Our study shows three key findings: (i) LLMs possess latent routing ability that can be reliably elicited with guidelines, allowing single-sample performance to approach that of multi-sample (Best-of-N) results; (ii) linear probes reveal that structured prompts sharpen the separability of the "optimal routing dimension" in representation space; and (iii) distillation transfers this reasoning structure to smaller models for lightweight deployment. Experiments on LaRA (in-domain) and LongBench-v2 (OOD) confirm that Pre-Route outperforms Always-RAG, Always-LC, and Self-Route baselines, achieving superior overall cost-effectiveness.