Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

2026-06-09 • Computation and Language

Computation and Language

AI summaryⓘ

The authors worked on improving full-duplex spoken dialogue models, which can listen and talk at the same time like in real conversations. They noticed that existing models were trained only to predict words correctly, which didn't help with natural interaction behaviors like when to pause or take turns. To fix this, they used reinforcement learning with specific rewards targeting key conversation skills such as managing pauses, turn-taking, backchanneling, and handling interruptions. They tested their method on two models and found it made conversations flow more naturally in both recordings and live dialogues.

full-duplex dialogue modelsreinforcement learningturn-takingbackchannelingpause handlingspoken dialogue systemsreward functionslanguage modelsinteractive behavior

Authors

Atsumoto Ohashi, Neil Zeghidour, Alexandre Défossez, Eugene Kharitonov

Abstract

Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level likelihood maximization, which does not directly optimize interaction-level behaviors, causing interactivity issues such as excessive silence and ill-timed turn-taking. Recent work has applied reinforcement learning (RL) to improve interactivity, but existing methods address only a limited set of interactive behaviors in their rewards. In this work, we propose a post-training alignment method that comprehensively improves the interactivity of full-duplex spoken dialogue models through RL. We address the four canonical axes of interactivity: pause handling, turn-taking, backchanneling, and user interruption. For each axis, we extract short audio segments from human conversation corpora and optimize the model with axis-specific reward functions. An extra LLM-based reward for response quality prevents semantic degradation. We apply our method to two open-source models, Moshi and PersonaPlex, demonstrating consistent improvements in interactivity on both offline evaluation with pre-recorded audio and real-time multi-turn dialogue evaluation.

View PDFOpen arXiv