Unlocking In-Context Learning in Audio-Language Models from Decentralized Medical Audio

2026-06-22 • Machine Learning

Machine LearningSound

AI summaryⓘ

The authors developed a new method called Federated Self-Contextualization (FSC) to help diagnose medical conditions from clinical sounds, like breathing or heartbeats, using very few labeled examples. Their approach groups similar audio sounds without needing many labeled recordings and uses a language model to make diagnoses by comparing known examples with new sounds. They trained the model across multiple hospitals without sharing sensitive data and tested it on respiratory and heart conditions, achieving better accuracy than previous methods. This shows their method can work well in places with limited medical data.

Federated learningClinical audio diagnosisSelf-contextualizationMultimodal language modelsUnsupervised clusteringIn-context learningSupport-query pairsEpisodic trainingRespiratory conditionsCardiac conditions

Authors

Ran Piao, Tsai-Ning Wang, Martijn den Dekker, Linda Moonen, Hareld Kemps, Yuan Lu, Aaqib Saeed

Abstract

Clinical audio diagnosis in low-resource settings requires models that identify conditions from minimal examples without large annotated corpora. We propose Federated Self-Contextualization (FSC), a multimodal language model framework for in-context clinical audio diagnosis across federated hospital clients. FSC constructs pseudo-label episodes via unsupervised clustering of audio representations, bypassing scarce real diagnostic labels, and enables contextual reasoning from support-query pairs. Our progressive three-stage pipeline first aligns audio embeddings with the language model via caption-based pretraining, then adapts it for episodic in-context inference through federated optimization. At test time, given a small labeled support set, the model diagnoses an unseen query through multimodal reasoning. On held-out respiratory and cardiac conditions, FSC achieves 71.6% accuracy in 2-way 2-shot evaluation, outperforming audio-language baselines by over 9%.

View PDFOpen arXiv