READ More than What You See: Reinforcement Learning for Accurate and Coherent Audio Description Generations

2026-06-22Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created READ, a new way to teach computers to make better audio descriptions for blind and low-vision viewers. Instead of just guessing words one by one, their system uses reinforcement learning to improve the whole description based on how well it matches examples, its length, style, and how smoothly it tells a story. Tests showed that READ makes better, more natural-sounding descriptions than earlier methods. This approach shows that teaching models with rewards can help create clearer and more useful audio descriptions.

Audio DescriptionReinforcement LearningMultimodal ModelsSequence OptimizationNext-token PredictionNarrative CoherenceMAD-EvalCMD-ADTV-ADReference Matching
Authors
Bo Fang, Xinyao Zhang, Yuxin Song, Hui Zhang, Hang Zhou, Antoni B. Chan
Abstract
Audio Description aims to generate concise narrations of essential visual content in audio-visual media for blind and low-vision audiences. Existing methods either rely on prompting off-the-shelf multimodal models, which often mismatch AD style, or partially optimize training-based systems with next-token prediction, which under-explores model capacity and biases generation toward generic expressions. We present READ, the first reinforcement-learning (RL) framework for training-based AD generation. READ formulates AD as sequence-level optimization with reference-matching, length, and format rewards, and further introduces a dedicated coherence reward under context-aware supervision to promote narratively coherent descriptions. Experiments on MAD-Eval, CMD-AD, and TV-AD show that READ substantially outperforms prior methods across diverse evaluation metrics. Our results highlight RL as a promising paradigm for accurate and coherent AD generation. Our codes, models, and benchmark results will be publicly available.