Semi-Supervised Speech Confidence Detection using Pseudo-Labelling and Whisper Embeddings

2026-06-15 • Sound

SoundMachine Learning

AI summaryⓘ

The authors created a new way to tell how confident someone sounds when they speak, which is helpful for teachers giving feedback. They combined traditional speech clues like pitch and speed with advanced sound features from a tool called Whisper. To get more data to train their system, they used a method where the model creates extra labels to learn from. Their approach mixes all these details and reaches 75% accuracy in detecting confidence. This work helps improve tools that support learning and speaking skills.

speaker confidenceWhisper encoderpseudo-labellingspeech featuresco-attention mechanismpitchspeech ratedisfluenciesstress patternspersonalised feedback

Authors

Adam Wynn, Jingyun Wang, Xiangyu Tan

Abstract

Understanding speaker confidence is crucial in educational settings, as it can enhance personalised feedback and improve learning outcomes. This study introduces a novel framework for detecting speaker confidence by integrating human-engineered features with embeddings from the Whisper encoder. To address data limitations, a pseudo-labelling technique is employed to expand the labelled dataset, allowing the model to learn from both human-annotated and model-generated labels. The framework combines traditional speech features including pitch, volume, rate of speech, and the presence of disfluencies and stress, with Whisper embeddings, and uses a co-attention mechanism to fuse these representations and achieve an overall accuracy of 75%. This study contributes to advancing speech analysis, enabling applications that support personalised learning and speaking skill development.

View PDFOpen arXiv