How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

2026-05-11Computation and Language

Computation and Language
AI summary

The authors explore how to make large language models listen and talk at the same time, which is hard because these models usually handle one input sequence at a time. They compare two ways to handle the user's speech: mixing it directly into the model's input or letting the model access it separately through a special memory system. Mixing the user input directly helps answer questions better but can cause confusion if the user interrupts. The separate memory method keeps the model's ongoing response clearer but isn't as good at answering questions. Their work highlights an important tradeoff in designing systems that can hear and respond simultaneously.

full-duplex spoken dialoguelarge language modelscross-attentionchannel fusionsemantic groundingcontext corruptionuser interruptionsspoken question answeringdialogue system architectureexternal memory
Authors
Hui Lu, Xueyuan Chen, Huimeng Wang, Shuhai Peng, Shiyin Kang, Xixin Wu, Zhiyong Wu
Abstract
Full-duplex spoken dialogue requires a model to keep listening while generating its own spoken response. This is challenging for large language models (LLMs), which are designed to extend a single coherent sequence and do not naturally support user input arriving during generation. We argue that how the user stream is routed into the LLM is therefore a key architectural question for full-duplex modeling. To study this question, we extend a text-only LLM into a unified full-duplex spoken dialogue system and compare two routing strategies under a shared training pipeline: (i) channel fusion, which injects the user stream directly into the LLM input, and (ii) cross-attention routing, which keeps the user stream as external memory accessed through cross-attention adapters. Experiments on spoken question answering and full-duplex interaction benchmarks reveal a clear tradeoff. Channel fusion yields stronger semantic grounding and consistently better question-answering performance. However, under semantically overlapping conditions such as user interruptions, it is more vulnerable to context corruption: if the model fails to stop in time, the overlapping user stream can interfere with ongoing generation and lead to semantically incoherent continuations. Cross-attention routing underperforms on question answering, but better preserves the LLM generation context and is more robust to this failure mode. These results establish user-stream routing as a central design axis in full-duplex spoken dialogue and offer practical guidance on the tradeoff between semantic integration and context robustness. We provide a demo page for qualitative inspection.