Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization
2026-04-07 • Sound
SoundArtificial Intelligence
AI summaryⓘ
The authors created a new synthetic dataset to help computers understand and work with long audio conversations, especially doctor-patient talks. They made fake but realistic audio recordings including multiple speakers and background sounds, along with notes summarizing the conversations called SOAP notes. This dataset can be used to both train AI and check how well it works in this task. They found that breaking the problem into steps works better than trying to solve it all at once with current AI models.
long-context audio reasoningsynthetic data generationdoctor-patient conversationSOAP notesmulti-speaker audio synthesislanguage modelsaudio evaluationcascaded modelsend-to-end models
Authors
Yanis Labrak, David Grünert, Séverin Baroudi, Jiyun Chun, Pawel Cyrta, Sergio Burdisso, Ahmed Hassoon, David Liu, Adam Rothschild, Reed Van Deusen, Petr Motlicek, Andrew Perrault, Ricard Marxer, Thomas Schaaf
Abstract
Long-context audio reasoning is underserved in both training data and evaluation. Existing benchmarks target short-context tasks, and the open-ended generation tasks most relevant to long-context reasoning pose well-known challenges for automatic evaluation. We propose a synthetic data generation pipeline designed to serve both as a training resource and as a controlled evaluation environment, and instantiate it for first-visit doctor-patient conversations with SOAP note generation as the task. The pipeline has three stages, persona-driven dialogue generation, multi-speaker audio synthesis with overlap/pause modeling, room acoustics, and sound events, and LLM-based reference SOAP note production, built entirely on open-weight models. We release 8,800 synthetic conversations with 1.3k hours of corresponding audio and reference notes. Evaluating current open-weight systems, we find that cascaded approaches still substantially outperform end-to-end models.