AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving
2026-06-08 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors developed AGENTSERVESIM, a tool that simulates how multi-step AI models (LLM agents) work when they need to use external tools during conversations. Unlike existing simulators, it accurately models how these AI systems keep track of their memory and manage resources across multiple turns in a conversation. Their simulator can predict real system behavior closely, helping researchers test different setups without expensive hardware. This makes it easier to improve how these AI agents are served in real applications.
LLM agentsmulti-turn conversationsKV-cachesimulationtool invocationprogram executionhardware-aware simulatorresource schedulingcache localitysession routing
Authors
Rakibul Hasan Rajib, Mengxin Zheng, Qian Lou
Abstract
Multi-turn LLM agents interleave model calls with external tool invocations, shifting serving from stateless request processing to stateful program execution. Serving these workloads requires scheduling, KV-cache management, and routing policies that use program-level context, including turn dependencies, tool-induced gaps, and reusable KV state. Evaluating such policies directly on real systems is costly, since each design point may require dedicated accelerator time across arrival rates, model scales, serving-instance counts, and memory hierarchies. Simulation offers a scalable alternative, but existing LLM serving simulators target stateless request-level workloads and therefore omit the core dynamics of agent serving: multi-turn program execution, cross-turn cache locality, and KV-cache residency during tool gaps. We present AGENTSERVESIM, a hardware-aware simulator for multi-turn LLM agent serving. AGENTSERVESIM evaluates serving policies at program granularity through composable modules: a Program Orchestrator preserves program identity and turn order, a Tool Simulator materializes tool-induced gaps, a Session-Aware Router maintains program-to-instance affinity for cache-aware dispatch, and a KV Residency Model tracks policy-defined KV placement across HBM, host DRAM/CXL, and eviction. Across real serving deployments and hardware configurations, AGENTSERVESIM reproduces real-system behavior within 6% error across key performance metrics while running entirely on commodity CPUs. These results show that AGENTSERVESIM enables controlled, repeatable exploration of agent-serving policies without requiring exhaustive deployment on costly accelerators.