Can Vision Language Models Judge Action Quality? An Empirical Evaluation

2026-04-09 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial IntelligenceComputation and Language

AI summaryⓘ

The authors tested advanced Vision Language Models (VLMs) to see how well they can judge the quality of physical actions like sports movements or exercises. They found that these models only did a little better than guessing randomly and struggled especially with detailed movement quality. Even adding extra information or different ways of asking didn’t help consistently. The study points out that current VLMs have fundamental challenges in assessing fine details of actions, setting a baseline for future improvements.

Action Quality AssessmentVision Language Modelsfine-grained movementprompting strategiesskeleton informationin-context learningmodel biasescontrastive reformulationphysical therapysports coaching

Authors

Miguel Monte e Freitas, Rui Henriques, Ricardo Rei, Pedro Henrique Martins

Abstract

Action Quality Assessment (AQA) has broad applications in physical therapy, sports coaching, and competitive judging. Although Vision Language Models (VLMs) hold considerable promise for AQA, their actual performance in this domain remains largely uncharacterised. We present a comprehensive evaluation of state-of-the-art VLMs across activity domains (e.g. fitness, figure skating, diving), tasks, representations, and prompting strategies. Baseline results reveal that Gemini 3.1 Pro, Qwen3-VL and InternVL3.5 models perform only marginally above random chance, and although strategies such as incorporation of skeleton information, grounding instructions, reasoning structures and in-context learning lead to isolated gains, none is consistently effective. Analysis of prediction distributions uncovers two systematic biases: a tendency to predict correct execution regardless of visual evidence, and a sensitivity to superficial linguistic framing. Reformulating tasks contrastively to mitigate these biases yields minimal improvement, suggesting that the models' limitations go beyond these biases, pointing to a fundamental difficulty with fine-grained movement quality assessment. Our findings establish a rigorous baseline for future VLM-based AQA research and provide an actionable outline for failure modes requiring mitigation prior to reliable real-world deployment.

View PDFOpen arXiv