Know Before You Fetch: Calibrated Retrieval-Budget Allocation for Retrieval-Augmented Generation

2026-06-29Information Retrieval

Information RetrievalComputation and Language
AI summary

The authors studied a way to make retrieval-augmented generation (RAG) more efficient by deciding how much information to look up before answering a question. Instead of always retrieving a fixed number of passages, they calibrated the model's confidence to choose whether to answer directly, retrieve a small or large context, or not answer at all. They improved how well the model's uncertainty matched real correctness, which helped pick the right amount of information to retrieve. Their method worked well on several question-answering datasets and showed that careful confidence calibration can help balance accuracy, speed, and retrieval cost.

Retrieval-augmented generationCalibrationConfidence estimationQuestion answeringSequence log-probabilityUncertaintyLatencyRetrieval budgetAdaptive retrievalTriviaQA
Authors
Zhe Dong, Fang Qin, Manish Shah, Yicheng Wang
Abstract
Retrieval-augmented generation (RAG) typically retrieves a fixed number of passages for every query. This is wasteful when the reader already knows the answer, and it can be harmful when irrelevant or partially relevant passages distract the reader. We formulate adaptive RAG as calibrated retrieval-budget allocation: given a query, decide whether to answer closed-book, retrieve a compact context (k=1), retrieve a full context (k=5), or abstain. The contribution is a probability interface rather than a new raw uncertainty signal. We calibrate sequence log-probability and prefix-logit uncertainty signals into probabilities of correctness, then use these probabilities for graded context selection, selective abstention, and explicit latency/token trade-offs. Across core QA experiments on TriviaQA, Natural Questions, and MS MARCO, with auxiliary PopQA motivation and Qwen/Llama family checks, diagnostic out-of-fold calibration improves probability quality dramatically: for sequence log-probability, ECE drops from 0.275 to 0.062 on TriviaQA, 0.643 to 0.009 on NQ, and 0.711 to 0.031 on MS MARCO. Graded retrieval improves full-context and passage-budget frontiers for both our signal and TARG-style prefix entropy/margin, while retrieval-call AUC remains essentially tied with binary gating because k=1 is still a retrieval call. Held-out train/validation/test threshold experiments report deployable operating points. At matched-accuracy frontier operating points, a measured cost model reveals that gating is not universally faster: it increases latency by about 27% on Qwen3-8B but saves about 8% on Qwen3-32B. These results support a nuanced view of adaptive RAG: calibrated confidence is best understood as a reusable interface for allocating retrieval budget under task and system constraints.