Fast When, Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation
2026-06-15 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors address the challenge of managing speaking turns in conversations with multiple people, which is harder than in two-person talks. They create a two-step system using audio only: one part quickly spots possible moments when a speaker might stop, and the second part checks if the speaking floor should stay or switch to someone else. They test their method on real multiparty audio data and find it better at detecting speaker changes than previous methods. They also try a special way to mix background noise to make the system more robust, which further improves results.
turn-takingspoken dialogue systemsmultiparty conversationVoxConverse datasetaudio processingturn boundary detectionnext-speaker predictiondata augmentationdiffusion-based augmentation
Authors
Rutherford A. Patamia, Ming Liu, Wei Luo, Favour Ekong, Akan Cosgun
Abstract
Reliable turn-taking is essential for spoken dialogue systems. However, most existing methods are designed for two-speaker interaction and struggle with realistic multiparty audio containing overlap and rapid speaker changes. We study multiparty turn-taking on the VoxConverse dataset and propose an audio-only two-stage pipeline that separates when to trigger a turn boundary from whether the floor is actually transferring. A fast trigger scans the audio and proposes candidate end-of-turn times, while a lightweight verifier runs only at those times to decide \textsc{Hold} or \textsc{Shift} and support next-speaker prediction. We report results in the full multiparty setting and a controlled dyadic top-2 projection for comparability. We also investigate diffusion-based, label-preserving background-audio mixing as a data augmentation strategy. Results show improved shift detection over a baseline, with further improvements from diffusion augmentation.