LiveServe: Interaction-Aware Serving for Real-Time Omni-Modal LLMs
2026-06-22 • Distributed, Parallel, and Cluster Computing
Distributed, Parallel, and Cluster Computing
AI summaryⓘ
The authors developed LiveServe, a system that helps language models respond smoothly during live speech conversations where users can interrupt at any time. Unlike previous systems that wasted effort by generating unneeded text and managed memory inefficiently, LiveServe carefully tracks what audio the user has heard and prioritizes generating new audio just in time. It also smartly manages memory so needed data is ready right when it's needed, reducing delays. Their tests showed LiveServe speeds up audio response times and increases how many requests are completed successfully.
Omni-modal language modelsspeech-centric conversationskey-value (KV) cachebarge-in eventsscheduleraudio playbackreal-time servingthroughputlatencyvLLM
Authors
Xiangyu Zhi, Peiqi Yin, Sheng Guan, Chenguang Zheng, James Cheng, Xiao Yan
Abstract
Realtime omni-modal LMs support speech-centric conversations where users stream inputs, hear generated audio, and interrupt freely. Existing Omni-LM serving systems still rely on throughput-oriented LLM scheduling and LRU KV offloading. These policies ignore audio playback and multi-turn reuse: they may generate tokens far beyond what users hear, wasting work after barge-in, and evict KV state needed in the next turn. LiveServe is an interaction-aware serving system for realtime Omni-LM interaction. It exposes playback progress, speech activity, and barge-in events to the serving pipeline. The scheduler prioritizes first-audio and near-underrun sessions while limiting generation beyond the playback frontier. The KV manager uses next-use-aware eviction and preloads likely-needed KV during user speech to hide reload latency. On vLLM-Omni, LiveServe improves realtime serving across two Omni-LMs and mixed workloads. It lowers P90 audio TTFP by $1.55\times$ on average and up to $2.21\times$, while improving completed-request throughput by $1.15\times$ on average and up to $1.56\times$, and moves most KV reload work off the next-turn critical path.