DuplexOmni: Real-Time Listening, Seeing, Thinking, and Speaking for Full-Duplex Interaction
2026-06-08 • Human-Computer Interaction
Human-Computer Interaction
AI summaryⓘ
The authors introduce DuplexOmni, a system designed to support smooth, real-time conversations using audio, video, and text all at once. Their approach splits the work into two parts: one part handles the ongoing interaction by processing inputs and generating responses instantly, while the other part does deeper thinking and uses tools when needed. They also created a special training method to teach the system how to keep conversations flowing naturally. Tests show that DuplexOmni performs well on various benchmarks and can interact naturally in a continuous, two-way way.
multimodal interactionfull-duplexreal-time processingend-to-end systemcomplex reasoningtool useWriter-Director pipelinecontinuous interactionstreaming inputsbenchmark evaluation
Authors
Muye Huang, Lingling Zhang, Xingyu Yu, Lei Shi, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Jun Liu
Abstract
Human interaction is continuous, multimodal, and full-duplex by nature. Although recent omni models have made substantial progress in unified speech, vision, and text modeling, combining seamless real-time interaction with complex reasoning and tool use remains challenging. We present DuplexOmni, a method for real-time multimodal full-duplex interaction. DuplexOmni separates model capability into an interaction layer and a thinking layer, which collaborate asynchronously in parallel. The interaction layer is implemented by the DuplexOmni model, an end-to-end system that processes streaming audio and video inputs while generating text and speech responses in real time. The thinking layer is a pluggable module that provides complex reasoning and tool-use capabilities. To support this method, we further develop a Writer-Director pipeline for constructing continuous-interaction training data. Experiments show that DuplexOmni achieves strong performance on multiple public benchmarks and exhibits natural full-duplex interaction ability.