Voice Biomarkers for Depression and Anxiety

2026-05-11 • Machine Learning

Machine LearningArtificial IntelligenceSound

AI summaryⓘ

The authors worked on detecting depression and anxiety by analyzing speech using deep learning instead of traditional methods that rely on hand-picked speech features. They trained their model on a large collection of speech samples from over 23,000 people to learn patterns linked to mental health. Their model can identify important speech markers without depending on the speech content and performs better when combined with text features from the audio. Tested on about 5,000 people, the model achieved balanced accuracy. They also shared their best model publicly to help other researchers.

deep learningspeech signalsparalinguistic featuresbiomarkersmachine learningdepression detectionanxiety detectionsensitivityspecificitylexical features

Authors

Oleksii Abramenko, Noah D. Stein, Colin Vaz

Abstract

Current approaches to detecting depression and anxiety from speech primarily rely on machine learning techniques that utilize hand-engineered paralinguistic features and related acoustic descriptors derived from time- and frequency-domain representations of speech signals. Applying deep learning methods directly to raw speech signals has the potential to produce biomarker representations with substantially greater predictive power. However, these approaches typically require large volumes of carefully annotated data to learn robust and clinically meaningful representations of the underlying biomarkers. In this paper, we describe our efforts toward developing a deep learning model trained on a large-scale proprietary dataset comprising ~65,000 utterances collected from more than 23,000 subjects representative of relevant United States demographics. We present the techniques employed and analyze their impact on model performance. Our results demonstrate that the proposed models can extract content-agnostic biomarker information, which, when combined with lexical features extracted from audio, yields improved predictive performance in production settings. Our models are evaluated on ~5000 unique subjects and achieve performance of 71% in terms of sensitivity and specificity. To foster further research in mental health assessment from speech, we release the best-performing model described in this paper on HuggingFace.

View PDFOpen arXiv