SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

2026-06-01Sound

Sound
AI summary

The authors created SpeechEditBench, a new test set to check how well speech models can follow instructions to change certain parts of speech while keeping other parts the same. Their benchmark includes simple tasks and harder ones that combine multiple edits. They also designed a way to measure if the changes were made correctly without messing up other parts. Testing different speech models, they found no model is perfect at all tasks, closed-source ones usually do better, and combining multiple edits is still very hard. Their work helps find where current models struggle and guides improvements.

Speech Large Language Modelsspeech editinginstruction-guided editingbenchmarkmulti-attribute editingevaluation metricsedit successattribute preservationcompositional editingclosed-source models
Authors
Hanlin Zhang, Daxin Tan, Dehua Tao, Xiao Chen, Haochen Tan, Linqi Song
Abstract
Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unrelated characteristics. Despite rapid progress in Speech Large Language Models (Speech LLMs), systematic evaluation of this capability remains challenging, as existing benchmarks are fragmented across isolated editing tasks. To bridge this gap, we introduce \textbf{SpeechEditBench}, a bilingual multi-attribute benchmark for instruction-guided speech editing. SpeechEditBench encompasses seven atomic editing tasks, as well as compositional editing tasks that integrate multiple operations within a single instruction. We propose an anchor-based evaluation protocol that separately assesses the edit success of target attributes and the preservation of untargeted attributes, leading to three metrics: target success, preservation success, and joint success. Using this benchmark, we evaluate mainstream Speech LLMs and specialized speech editing systems. The results reveal three key findings: (1) no single model performs well across all editing dimensions; (2) closed-source Speech LLMs generally outperform open-source models; (3) compositional editing remains highly challenging, with even the most advanced models struggling to achieve high joint success. SpeechEditBench provides a rigorous diagnostic framework to identify bottlenecks in Speech LLMs, thereby facilitating the development of next-generation Speech LLMs with more robust and precise instruction-guided editing capabilities. Data and code will be released upon acceptance.