Automated Essay Scoring and Language Certification: Assessing Generalizability, Agreement and Validity for French

2026-06-01 • Computation and Language

Computation and Language

AI summaryⓘ

The authors looked at how automated essay scoring systems are usually tested and found that these tests are often too simple. They improved a detailed evaluation framework called the argument-based validation framework (ABV) by adding new checks like fairness and error analysis. Using this improved framework, they tested eight different models on a large set of French exam essays scored by humans. Their work helps us better understand how well these computer systems score essays and points out where they might make mistakes.

Automated Essay ScoringArgument-Based ValidationFairness AnalysisPrediction ErrorModel AgreementLinguistic FeaturesHigh-Stakes TestingFrench Language Processing

Authors

Rodrigo Wilkens, Rémi Cardon, Vincent Folny, Thomas François

Abstract

In Automated Essay Scoring (AES), benchmarking practices have fostered minimalist evaluation practices, in contrast with the broader-view recommendations of evaluation frameworks, such as the argument-based validation framework (ABV), which argued in favor of a multidimensional assessment of systems, especially in the context of high-stakes language tests. In this paper, we introduce an enhanced and more practical version of the ABV framework, incorporating fairness analysis, correlations with linguistic features, prediction error evaluation, and model agreement compared with human raters. Applying this framework to French AES, we compare 8 model architectures on a corpus of 27k exam essays (2 raters each) and a generalization corpus of 961 essays (at least nine raters each). Our analyses illustrate the benefits of applying the ABV framework to better understand the capabilities and pitfalls of AES models, while also advancing the state-of-the-art for French AES.

View PDFOpen arXiv