Semi-Supervised Sound Event Detection with Conditional Mixup and Embedding-Level Contrastive Loss

2026-06-29Artificial Intelligence

Artificial Intelligence
AI summary

The authors address the problem of detecting sounds when there is limited labeled data but plenty of unlabeled data. They improve a previous method by adding a type of training that helps the model learn better from unlabeled sounds using a contrastive learning technique at the feature level. They solve a tricky issue involving how mixed audio data is used differently in two parts of the training by inventing 'conditional mixup,' which blends two mixing methods. Their improved approach leads to better detection scores on a standard sound event dataset. Overall, the work helps make sound detection more accurate when labeled examples are scarce.

sound event detectionpseudo-labelingsemi-supervised learningcontrastive learningembedding-level lossmixupself-supervised learningaudio foundation modelsDESED dataset
Authors
Nian Shao, Xian Li, Xiaofei Li
Abstract
Sound event detection (SED) is a core module for acoustic environmental analysis, yet its performance is often limited by scarce labeled data. Recent systems leverage large pretrained audio foundation models, but effective fine-tuning remains challenging because labeled data are limited while unlabeled data are abundant. A previous work, ATST-SED, addressed this problem with a pseudo-label based semi-supervised fine-tuning framework. In this work, we further improve the framework by adopting an embedding-level self-supervised contrastive loss inspired by ATST-Frame pretraining. This contrastive objective better exploits unlabeled data during fine-tuning. One challenge is that mixup serves different roles in the two objectives: pseudo-label learning uses composition mixup, while contrastive learning treats mixup as a perturbation. To resolve this mismatch, we propose conditional mixup, which combines composition mixup and perturbation mixup in one semi-supervised framework and defines the corresponding embedding-level contrastive losses. The resulting model achieves 0.645 PSDS1 and 0.822 PSDS2 on the DESED validation set, establishing a new state of the art.