Scoring Backends Matter More Than Pooling: A Systematic Study of Training-Free Anomalous Sound Detection under Domain Shift

2026-06-17Sound

Sound
AI summary

The authors studied different methods to spot unusual sounds without extra training, using a fixed audio encoder. They found that the choice of scoring method (backend) is much more important than how short-time features are combined over time (pooling) for handling new sound environments. No single scoring method works best for all machine types, but using a smart combination of methods improves detection. They also found that trying to pick the best method based on similar but artificial tasks doesn’t work well.

anomalous sound detectiondomain shiftaudio encodertemporal poolingnearest-neighborMahalanobis distancek-nearest neighborsPCAscore fusionunsupervised learning
Authors
Jingwen Zhou, Mingzhe Wang
Abstract
Training-free anomalous sound detection (ASD) scores a test clip against a memory bank of normal embeddings from a frozen pretrained audio encoder. Recent work attributes domain-shift robustness mainly to how frame-level features are pooled over time; the scoring backend applied on top of the pooled embedding has received far less systematic attention. Using a single frozen BEATs encoder on the DCASE 2023 Task 2 development set (all seven machine types), we cross four classical backends -- nearest-neighbor cosine distance, Mahalanobis distance, locally density-normalized kNN, and PCA-subspace reconstruction residual -- with three temporal poolings (mean, GeM, max). Switching the backend moves target-domain AUC by 13.8 points on average (up to 53.8), whereas switching the pooling moves it by only 3.2 points: in this training-free regime, the backend, not the pooling, dominates domain-shift robustness. No backend wins everywhere, but the machine-dependent pattern reproduces on the DCASE 2025 development data (fan, bearing). Exploiting this, we propose a label-free score fusion that z-normalizes each backend with its training-bank self-scores and takes the minimum; it reaches a harmonic-mean target AUC of 63.3% versus 64.4% for the per-machine oracle, surpassing every fixed single backend while preserving source-domain accuracy. We also report a negative result: selecting a backend by source-domain pseudo-validation with proxy outliers fails, because all backends saturate on the proxy task.