RASER: Recoverability-Aware Selective Escalation Router for Multi-Hop Question Answering

2026-06-01 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors found that many multi-step questions can be answered correctly using just one round of retrieval, so doing extra searches wastes time and computing power. They created RASER, a simple system that decides when to stop searching or when to do more retrieval, without needing extra complicated calls. Their methods balance accuracy and cost better than other systems, using fewer tokens while maintaining good performance across different tests. This helps make question-answering systems more efficient when budgets are limited.

multi-hop question answeringretrieval-augmented generation (RAG)large language models (LLMs)one-shot retrievaliteration in retrievalquestion decompositiontoken costcost-accuracy tradeoff

Authors

Yuyang Li, Zihe Yan, Tobias Käfer

Abstract

Multi-hop question-answering systems often use expensive retrieval on every question. They may decompose the question, run several retrieval rounds, or search through bridge entities before answering. All of these strategies rely on repeated LLM calls to rewrite or decompose the question, which increases extra token cost, and it is not fitting when the LLM budget is tight. However, our analysis shows that lots of multi-hop questions are already answered correctly by a single one-shot RAG, so running an extra retrieval on every question wastes the budget. We introduce RASER (Recoverability-Aware Selective Escalation Router), a family of cheap routers built on one-shot RAG and six features from it. RASER-2 decides whether to stop or escalate to the extra-retrieval action PRUNE. RASER-3 chooses among one-shot RAG, PRUNE, and iterative retrieval IRCoT, using the same features but adding an explicit cost-accuracy trade-off. Neither router makes an extra LLM call to decide. Across six LLMs and three multi-hop QA benchmarks, both routers stay competitive with the other state-of-the-art (SOTA) baselines in F1 while spending only 41-49% of always-prune's tokens and also less than the iterative and decomposition retrieval baselines.

View PDFOpen arXiv