Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

2026-06-17Sound

SoundArtificial IntelligenceComputer Vision and Pattern Recognition
AI summary

The authors created ScenA, a system that generates multi-speaker audio scenes using a general audio model trained on real-world sounds, not just clean speech. Instead of tagging each speaker turn by turn, ScenA uses example voices and a natural language description of the whole scene to produce audio with overlapping talk, background noise, and emotions. They solved a problem where the model ignored the text prompt by changing the training noise levels, forcing it to pay attention to the scene description. Their tests show ScenA works better than previous systems at matching speakers and creating realistic multi-person audio. This approach moves away from scripted speech pipelines to more natural-sounding audio generation.

multi-speaker dialogue systemsflow-matching modeltext-to-audio generationspeaker bindingnoise schedulepositional encodingambient audioCoVoMix2-Dialogue benchmarkparalinguistic eventsreference shortcut
Authors
Michael Finkelson, Daniel Segal, Eitan Richardson, Shahar Armon, Nani Goldring, Poriya Panet, Nir Zabari, Benjamin Brazowski, Or Patashnik, Yoav HaCohen
Abstract
Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.