Rethinking Weak Supervision in Anomaly Detection: A Comprehensive Benchmark

2026-05-25 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors created WSADBench, a new benchmark to compare different weakly supervised anomaly detection (WSAD) methods all in one place. They tested 36 algorithms under various conditions like label quality and quantity across multiple data types. Their findings show that different WSAD approaches are more connected than previously thought, and simpler general models can outperform specialized ones when more labeled data is available. They also found that unlabeled data helps only a little, and models respond differently to various kinds of label mistakes. WSADBench is available as open-source to support future research.

Weakly Supervised Anomaly DetectionLabel ScarcityLabel NoiseOut-of-Distribution (OOD)Tabular Foundation ModelsBenchmarkingLabel QualityAnomaly DetectionUnlabeled DataSupervision Levels

Authors

Xu Yao, Siyuan Zhou, Wu Zhenbo, Chaochuan Hou, Shuang Liang, Shiping wang, Hailiang Huang, Songqiao Han, Minqi Jiang

Abstract

Weakly supervised anomaly detection (WSAD) has developed in three primary directions: incomplete, inexact, and inaccurate supervision. However, these directions remain isolated, lacking a unified framework to assess whether they address unique challenges or share fundamental mechanics. This paper introduces WSADBench, the first benchmark that unifies evaluation across distinct weakly supervised scenarios, benchmarking diverse approaches from specialized WSAD methods to advanced tabular foundation models. WSADBench establishes standardized protocols to evaluate 36 algorithms across 4 modalities by systematically varying label quantity, granularity, and quality, revealing the performance boundaries of various methods. Based on over 700K experiments, WSADBench reveals four critical insights: (i) Strong intrinsic correlations exist between these weak supervision scenarios, challenging the isolation of current research directions. (ii) Specialized WSAD algorithms excel only in extreme label-scarcity regimes but are quickly dominated by tabular foundation models and general classification methods as supervision increases or in OOD scenarios. (iii) Unlabeled data shows inconsistent utility across settings, with marginal gains compared to label refinement. (iv) Models exhibit asymmetric sensitivity to different types of label noise. We release WSADBench as an open-source benchmark with code and datasets to facilitate future WSAD research: https://github.com/SUFE-AILAB/WSADBench.

View PDFOpen arXiv