A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales

2026-06-08Computation and Language

Computation and LanguageArtificial Intelligence
AI summary

The authors developed a speech assessment model, SpeechLLM, that can evaluate different parts of second-language (L2) speech, like sentence-level accuracy, fluency, and prosody, as well as word and phoneme accuracy. It also explains its evaluations in natural language. They trained the model using a mix of techniques to improve both accuracy and explanation quality. Their tests showed the model performs as well or better than models focusing on only one level of detail. However, while the explanations are reliable for whole sentences, they are less accurate for detailed word or phoneme-level feedback.

Automated Speech AssessmentSecond Language (L2) ProficiencySpeechLLMMulti-Granular AssessmentSupervised Fine-TuningBounded Direct Preference OptimizationOrdinal LabelsSentence-Level EvaluationWord/Phoneme-Level AccuracyNatural-Language Rationale
Authors
Aditya Kamlesh Parikh, Cristian Tejedor-Garcia, Catia Cucchiarini, Helmer Strik
Abstract
Automated L2 speech assessment can assign proficiency labels, but often lacks interpretability. We propose a rubric-guided SpeechLLM for multi-aspect, multi-granular assessment, trained with a hybrid objective combining supervised fine-tuning and Bounded Direct Preference Optimization. The model jointly predicts ordinal labels at the sentence-level (accuracy, fluency, prosody), word/phoneme-level accuracy, and generates a natural-language rationale in the same response. On SpeechOcean762, our approach matches or outperforms single-granularity models while remaining competitive with prior approaches. We analyze rationale reliability along two axes: self-consistency with model predictions and alignment with ground-truth labels, using sentiment consistency (plausibility) and mention-based agreement (faithfulness). Rationales are plausible at the sentence level, but faithfulness degrades at the word/phoneme level: references are sparse and weakly aligned with token-level labels.