RLM-Cascade: Response-Level Speculative Decoding for Cost-Efficient LLM API Serving

2026-06-22Machine Learning

Machine Learning
AI summary

The authors created RLM-Cascade, a system that lowers the cost of using large language models (LLMs) by first generating quick draft answers with a simple model and only involving the more advanced model if needed. They tested it on coding tasks and found it used the expensive model less than half the time, cutting costs by nearly half and making responses faster. Their system also keeps or improves answer quality compared to using the advanced model alone. They added smart rules to decide when to skip the expensive model based on how complex the task is. RLM-Cascade is already used in real-world settings and the team shared it as open source.

speculative decodinglarge language models (LLMs)proxy-layer systemAPI cost reductiondraft modelverify modelcomplexity routercode generationlatency reductionopen source deployment
Authors
Haifeng Wu, Srinivasan Manoharan, Fangbo Tu, Junhua Zhao, Jian Wan
Abstract
We present RLM-Cascade, a proxy-layer system that applies speculative decoding at the response level to reduce LLM API costs without requiring model architecture access or a shared vocabulary. A fast, inexpensive draft model generates a candidate response; a capable verify model accepts, enhances, or is bypassed entirely depending on a lightweight complexity router. On a real-world agentic coding workload (Claude Code), RLM-Cascade achieves a draft-use rate of 88.8% across 125 production requests, reducing API cost by 45.8% relative to a direct Opus baseline. Counter-intuitively, the proxy also reduces end-to-end latency: median response time is 2,026 ms versus 3,698 ms for Native Opus -- a 1.83X speedup at p50 -- because the SKIPPED path (DeepSeek only, no Opus call) dominates the workload distribution. Quality matches or exceeds the Opus baseline: 100% pass rate on a 20-task Code/Math/Instruct benchmark versus 95% for Native Opus. We further describe a rule-based complexity router that selects the SKIPPED path for simple agentic turns and a hybrid tool-call strategy that bypasses the speculative pipeline for schema-critical tool-selection turns. RLM-Cascade is deployed in production as an enterprise AI infrastructure component and published as open source with a live metrics dashboard and Prometheus endpoint.