DialogueSidon: Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio
2026-04-10 • Sound
Sound
AI summaryⓘ
The authors created DialogueSidon, a tool that can take a mixed-up audio recording of two people talking and separate each speaker's voice clearly, even if the original recording is low quality. It uses a combination of a special encoder to compress sound features and a diffusion method to predict individual voices from the mixed audio. Tests show that DialogueSidon makes it easier to understand each speaker and works faster than previous methods across different languages and real-world recordings.
full-duplex audiomonaural mixturespeech separationvariational autoencoder (VAE)self-supervised learning (SSL)latent spacediffusion modeldialogue audiospeaker separation
Authors
Wataru Nakata, Yuki Saito, Kazuki Yamauchi, Emiru Tsunoo, Hiroshi Saruwatari
Abstract
Full-duplex dialogue audio, in which each speaker is recorded on a separate track, is an important resource for spoken dialogue research, but is difficult to collect at scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural mixtures, making it unsuitable for systems requiring clean speaker-wise signals. We propose DialogueSidon, a model for joint restoration and separation of degraded monaural two-speaker dialogue audio. DialogueSidon combines a variational autoencoder (VAE) operates on the speech self-supervised learning (SSL) model feature, which compresses SSL model features into a compact latent space, with a diffusion-based latent predictor that recovers speaker-wise latent representations from the degraded mixture. Experiments on English, multilingual, and in-the-wild dialogue datasets show that DialogueSidon substantially improves intelligibility and separation quality over a baseline, while also achieving much faster inference.