Dual-Granularity Orthogonal Disentanglement for Generalizable Audio Deepfake Detection
2026-06-15 • Sound
SoundArtificial Intelligence
AI summaryⓘ
The authors noticed that audio deepfake detectors often get confused by differences in speaker voices rather than recognizing fake sounds. To fix this, they created a method that separates speaker identity from fake sound features by making these two types of information independent in the model. Their approach gradually increases this separation without needing extra complex parts or unstable training tricks. Tests on various datasets showed their method improved detection accuracy, especially when tested on new datasets. Overall, the authors' approach helps deepfake detectors focus on fake sound clues instead of speaker voices.
audio deepfake detectionspeaker identityfeature disentanglementorthogonality constraintcosine orthogonalitycross-covariance regularizationequal error ratecross-dataset generalization
Authors
Zhuodong Liu, Hugen Lv, Xiangyu Li, Chunhong Yuan
Abstract
Audio deepfake detectors often fail to generalize across speakers, as they learn speaker-identity features rather than synthesis artifacts, known as implicit identity leakage. Existing methods address this but incur architectural complexity or training instability. This paper proposes a dual-granularity orthogonal disentanglement framework enforcing feature independence at two levels: sample-level cosine orthogonality captures directional decorrelation, while batch-level cross-covariance regularization eliminates linear correlations across embedding dimensions. A curriculum disentanglement schedule progressively strengthens the orthogonality constraint without auxiliary networks or adversarial dynamics. Experiments on ASVspoof 2019 LA, ASVspoof 2021 DF, and In-the-Wild datasets demonstrate that the proposed method achieves 1.35%, 7.88%, and 21.58% equal error rates (EER), respectively, surpassing gradient reversal disentanglement by 2.60% absolute on cross-dataset transfer.