Reasoning LLM Improves Speaker Recognition in Long-form TV Dramas

2026-07-02Computation and Language

Computation and LanguageArtificial IntelligenceComputer Vision and Pattern Recognition
AI summary

The authors address the challenge of figuring out which character is speaking in TV dramas, which is important for understanding complicated storylines. They created a huge new dataset called DramaSR-532K with over half a million dialogue lines from more than 900 characters. They also designed a new method, DramaSR-LRM, that uses a smart model combining sound, text, and visuals to identify speakers more accurately. Their approach works especially well when the spoken lines are very short and sound alone isn’t enough. The authors will share their data and code publicly for others to use.

speaker recognitionlong-form TV dramamultimodal learninglarge-scale datasetaudio-visual integrationlanguage modelscontextual reasoningdialogue attributionacoustic biometricsmachine learning
Authors
Yuxuan Li, Lingxi Xie, Xinyue Huo, Jihao Qiu, Jiacheng Shao, Pengfei Chen, Jiannan Ge, Kaiwen Duan, Qi Tian
Abstract
Long-form TV dramas present a formidable challenge for comprehensive video understanding, where deciphering complex storyline often relies on \textbf{speaker recognition}, the task of accurately attributing each spoken utterance to its respective character. In this paper, we advance this field through two primary contributions. (1) We introduce \textbf{DramaSR-532K}, a large-scale benchmark comprising 532K annotated dialogue lines across more than 900 unique characters, necessitating the integration of auditory, linguistic, and visual cues for speaker recognition. (2) We propose \textbf{DramaSR-LRM}, a robust approach built upon a large reasoning model (LRM). DramaSR-LRM is designed to autonomously aggregate contextual evidence via multimodal tool-use, synthesizing diverse inputs to achieve high-fidelity attribution. Experimental results demonstrate that DramaSR-LRM significantly outperforms existing baselines, particularly on short utterances where acoustic biometrics are inherently unreliable. \textit{All the data and code will be made publicly available at the project page: https://www.github.com/198808xc/DramaSR-LRM.}