Stabilizing Short Duration Speaker Verification through Neural Re-scoring with Hybrid Enrollment

2026-06-15Sound

Sound
AI summary

The authors address the challenge of verifying speakers when the audio clips are very short, like under three seconds. They created a new dataset called VoxPhrase from VoxCeleb to study this problem. They found that verifying speakers using exact matching phrases (text-dependent) can be unstable if the audio is too short, while methods not relying on matching phrases (text-independent) get more stable as more enrollment audio is used. To improve performance, they developed a method that combines both approaches using a special neural network technique, and their tests show this leads to better speaker verification results.

short-duration speaker verificationVoxCeleb datasettext-dependent enrollmenttext-independent enrollmentspeaker representationkeyword spottingneural re-scoringcross-attentionspeaker modelsframe-level comparison
Authors
Zhiqi Ai, Han Cheng, Shiyi Mu, Zhiyong Chen, Yongjin Zhou, Shugong Xu
Abstract
Short-duration speaker verification (SDSV) is crucial for personalized keyword spotting, where test utterances are typically shorter than three seconds. Limited speech duration results in unstable speaker representations and increased sensitivity to noise and phoneme variations, thereby degrading performance. To investigate this issue, we construct VoxPhrase, a large-scale SDSV corpus automatically segmented from the VoxCeleb dataset. Our analysis shows that text-dependent (TD) enrollment is constrained by duration and yields unstable speaker representations. In contrast, although text-independent (TI) enrollment introduces content mismatch, its representations become more stable as the enrollment duration increases. Accordingly, we propose a hybrid-enrollment neural re-scoring framework that combines TD and TI enrollment and performs frame-level comparison via parallel cross-attention. Experiments on VoxPhrase demonstrate consistent improvements across multiple speaker models.