SeqRoute: Global Budget-Aware Sequential LLM Routing via Offline Reinforcement Learning

2026-05-25 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors identified that many systems treating user queries separately run out of computing resources too quickly, causing problems for harder questions later. They created SeqRoute, a method that plans how to use resources over multiple questions by learning when to save resources for tougher future queries. To improve learning, they invented a way to simulate many budget scenarios from past data. Their tests show SeqRoute saves a lot of cost while keeping or improving answer quality and avoids running out of budget more than traditional methods.

LLM routingmarkov decision processoffline reinforcement learningconservative q-learningbudget managementmulti-turn sessionsbehavior cloningcost-quality tradeoffhindsight budget relabelingzero-shot optimization

Authors

Zhongling Xu, Shunan Zheng, Wei Wang

Abstract

Existing LLM routing frameworks treat queries as independent events, neglecting the sequential nature of real-world user sessions constrained by global computational budgets. This mismatch inevitably leads to budget bankruptcy: myopic routing policies exhaust resources on early interactions, forcing subsequent and often more complex queries onto inadequate models. We introduce SeqRoute, a framework that formulates multi-turn routing as a finite-horizon Markov Decision Process and solves it via offline reinforcement learning. By incorporating the remaining budget into the state space and training with Conservative Q-Learning (CQL), SeqRoute learns delayed gratification to strategically preserve resources for high-stakes turns later in the session. To overcome data starvation, we propose Hindsight Budget Relabeling (HBR). This technique retrospectively simulates historical trajectories under diverse hypothetical budgets, expanding 10,000 raw sessions into 2.38 million transitions enriched with critical bankruptcy signals. At deployment, a dynamic $λ$-sweep mechanism enables zero-shot navigation of the cost-quality Pareto frontier without retraining. Extensive evaluations demonstrate that SeqRoute reduces operational costs by 6.0-73.5% while maintaining or improving quality, and suppresses bankruptcy rates to under 1%, strictly dominating behavior cloning, budget-aware heuristics, and static baselines across the entire Pareto frontier.

View PDFOpen arXiv