Mapping Whisper Representations to Human ECoG Responses with Interpretable Time-Resolved Neural Encoding

2026-06-01Human-Computer Interaction

Human-Computer Interaction
AI summary

The authors studied how a speech AI model called Whisper matches brain activity recorded while people listen to speech. They created a neural encoder that links different parts of Whisper's processing to brain signals over time, finding that middle layers of the model best align with how the brain processes speech. Their method captures details beyond simple mappings and shows that brain regions reflect specific sound categories called phonemes. Overall, the authors suggest that speech AI models like Whisper can help understand how the brain processes spoken language in real time.

Speech foundation modelsWhisperECoG (electrocorticography)Neural encoderTemporal modelSoft attentionHierarchical processingPhonemeBrain decodingCortical speech processing
Authors
Matteo Ciferri, Tommaso Boccato, Michal Olak, Matteo Ferrante, Nicola Toschi
Abstract
Understanding how speech foundation models relate to human cortical activity is a key challenge for computational neuroscience. Here, we investigate how internal representations from Whisper predict intracranial ECoG responses during naturalistic speech perception. We introduce a time-resolved neural encoder that combines speech embeddings with a recurrent temporal model and soft attention, allowing us to examine layer-wise brain alignment. Intermediate Whisper layers provide the strongest correspondence with neural activity, supporting a hierarchical match between model representations and cortical speech processing. Comparisons with baselines show that high-resolution ECoG responses benefit from temporally structured modelling beyond linear mappings from the same speech representations. In addition, attention maps reveal temporally local alignment between speech embeddings and neural responses, while a phonemic interpretability analysis identifies anatomically coherent phoneme-category organization among encoding-informative electrodes. Together, these results suggest that speech foundation models offer a useful framework for studying time-resolved cortical speech representations.