Robust Spoofed Speech Detection via Temporal Pyramid Modeling

2026-06-15 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceSound

AI summaryⓘ

The authors address the problem of detecting fake or spoofed speech, which is becoming harder with advanced methods that imitate real voices. They propose a new model called the Temporal Pyramid Adapter that looks at speech patterns at different time scales to spot subtle signs of fakeness. They combine this with self-supervised speech representations and test their method on several datasets, showing improved accuracy compared to earlier models. However, they note that the system still struggles when speech comes from different languages or domains, indicating more work is needed to handle these variations.

Spoofed speech detectionTemporal convolutionMulti-scale temporal modelingSelf-supervised learningXLS-RMel spectrogramEER (Equal Error Rate)ASVspoof datasetVoice conversionDomain adaptation

Authors

Mahtab Masoudi Nezhad, Nima Karimian

Abstract

Spoofed speech detection is increasingly challenged by realistic synthesis, voice conversion, and replay attacks, with cross-dataset generalization remaining a major limitation. This work we propose a Temporal Pyramid Adapter that utilize parallel temporal convolutions with varying receptive fields to capture multi-scale spoofing cues, ranging from local artifacts to global prosodic irregularities. We also integrated self-supervised XLS-R representations combined with front-end adapters, including Mel, Sinc, and a Temporal Pyramid design for multi-scale temporal modeling. The proposed model is evaluated cross multiple benchmark including ASVspoof 2017, ASVspoof 2021 (DF/LA), PartialSpoof, DiffSSD, and multilingual HQ-MPSD datasets. Experimental results demonstrate that Temporal Pyramid model obtained AUC of 99.24% and a EER of 3.87% on the PartialSpoof database, which is significantly outperforming the base model and several SOTA baseline such as LCNN-BLSTM (9.87% EER) and TRACE (8.08% EER). Additionally, multilingual evaluations confirm that while spoofing artifact are independent from language. While self-supervised representations improve robustness, performance degrades under domain and language shifts, highlighting the need for better adaptation and calibration strategies.

View PDFOpen arXiv