What Makes Synthetic Speech Sound Sarcastic? A Prosody-Controlled Perception Study

2026-06-08Sound

Sound
AI summary

The authors studied how changes in speech patterns help people recognize sarcasm. They used a special text-to-speech system to separately change loudness, pitch, and speed in speech, which is hard to do with normal recordings. People mainly used loudness to identify sarcasm, while a computer model focused more on speech speed. This approach helps understand which speech features matter most for hearing sarcasm.

prosodysarcasm perceptiontext-to-speech (TTS)speech ratepitch variationloudnessacoustic cuesspeech perceptionneural networksfoundation model
Authors
Zhu Li, Shekhar Nayak, Matt Coler
Abstract
Prosody plays a central role in sarcasm perception, yet previous studies have relied on naturally produced speech that lacks fine-grained control over individual acoustic dimensions. As prosodic cues co-vary in natural data, isolating their independent contributions remains challenging. We introduce a controlled framework using neural text-to-speech (TTS) with prompt-based prosodic conditioning to manipulate speech rate, pitch variation, and loudness. An orthogonal stimulus set was constructed to enable causal testing of prosodic cue effects. Human listeners rated sarcasm and naturalness, and their judgments were compared with predictions from a foundation model capable of processing audio input. Results show that loudness primarily drives human sarcasm perception, whereas the model assigns greater weight to speech rate, leading to distinct cue-weighting patterns. This study shows how controllable neural TTS enables investigation of prosodic cue weighting in speech perception.