LiveServe: Interaction-Aware Serving for Real-Time Omni-Modal LLMs

2026-06-22 • Distributed, Parallel, and Cluster Computing

Distributed, Parallel, and Cluster Computing

AI summaryⓘ

The authors developed LiveServe, a system that helps language models respond smoothly during live speech conversations where users can interrupt at any time. Unlike previous systems that wasted effort by generating unneeded text and managed memory inefficiently, LiveServe carefully tracks what audio the user has heard and prioritizes generating new audio just in time. It also smartly manages memory so needed data is ready right when it's needed, reducing delays. Their tests showed LiveServe speeds up audio response times and increases how many requests are completed successfully.

Omni-modal language modelsspeech-centric conversationskey-value (KV) cachebarge-in eventsscheduleraudio playbackreal-time servingthroughputlatencyvLLM

Authors

Xiangyu Zhi, Peiqi Yin, Sheng Guan, Chenguang Zheng, James Cheng, Xiao Yan

Abstract

Realtime omni-modal LMs support speech-centric conversations where users stream inputs, hear generated audio, and interrupt freely. Existing Omni-LM serving systems still rely on throughput-oriented LLM scheduling and LRU KV offloading. These policies ignore audio playback and multi-turn reuse: they may generate tokens far beyond what users hear, wasting work after barge-in, and evict KV state needed in the next turn. LiveServe is an interaction-aware serving system for realtime Omni-LM interaction. It exposes playback progress, speech activity, and barge-in events to the serving pipeline. The scheduler prioritizes first-audio and near-underrun sessions while limiting generation beyond the playback frontier. The KV manager uses next-use-aware eviction and preloads likely-needed KV during user speech to hide reload latency. On vLLM-Omni, LiveServe improves realtime serving across two Omni-LMs and mixed workloads. It lowers P90 audio TTFP by $1.55\times$ on average and up to $2.21\times$, while improving completed-request throughput by $1.15\times$ on average and up to $1.56\times$, and moves most KV reload work off the next-turn critical path.

View PDFOpen arXiv