Fast When, Careful Who: Dual-Process Multiparty Turn-Taking with Diffusion Augmentation

2026-06-15 • Computation and Language

Computation and LanguageArtificial Intelligence

AI summaryⓘ

The authors address the challenge of managing speaking turns in conversations with multiple people, which is harder than in two-person talks. They create a two-step system using audio only: one part quickly spots possible moments when a speaker might stop, and the second part checks if the speaking floor should stay or switch to someone else. They test their method on real multiparty audio data and find it better at detecting speaker changes than previous methods. They also try a special way to mix background noise to make the system more robust, which further improves results.

turn-takingspoken dialogue systemsmultiparty conversationVoxConverse datasetaudio processingturn boundary detectionnext-speaker predictiondata augmentationdiffusion-based augmentation

Authors

Rutherford A. Patamia, Ming Liu, Wei Luo, Favour Ekong, Akan Cosgun

Abstract

Reliable turn-taking is essential for spoken dialogue systems. However, most existing methods are designed for two-speaker interaction and struggle with realistic multiparty audio containing overlap and rapid speaker changes. We study multiparty turn-taking on the VoxConverse dataset and propose an audio-only two-stage pipeline that separates when to trigger a turn boundary from whether the floor is actually transferring. A fast trigger scans the audio and proposes candidate end-of-turn times, while a lightweight verifier runs only at those times to decide \textsc{Hold} or \textsc{Shift} and support next-speaker prediction. We report results in the full multiparty setting and a controlled dyadic top-2 projection for comparability. We also investigate diffusion-based, label-preserving background-audio mixing as a data augmentation strategy. Results show improved shift detection over a baseline, with further improvements from diffusion augmentation.

View PDFOpen arXiv