LexRubric: A Rubric-Guided Diagnostic Benchmark for Open-Ended Legal Tasks
2026-06-08 • Computation and Language
Computation and Language
AI summaryⓘ
The authors created LexRubric, a tool to test how well large language models (LLMs) can handle open-ended legal questions in Chinese. It uses a detailed set of rules to judge answers from everyday legal advice to professional legal exams, covering many different legal topics. They checked that these rules work well by comparing computer and human judgments. Testing 18 different LLMs showed that current models still struggle with these tricky legal questions. The data for this evaluation is publicly available online.
Large Language ModelsLegal AIBenchmarkingEvaluation RubricChinese Legal TasksLegal ConsultationJudicial ExaminationModel ReliabilityDiagnostic AnalysisOpen-ended Questions
Authors
Yifan Chen, Haitao Li, Yiran Hu, Kaisong Song, Jun Lin, Yueyue Wu, Qingyao Ai, Min Zhang, Yiqun Liu
Abstract
As large language models (LLMs) are increasingly applied to real-world legal tasks, evaluating the reliability of their open-ended legal responses has become essential. These tasks require context-sensitive answers and allow little room for error, motivating fine-grained and diagnostic evaluation that can identify specific sources of response quality failures. We introduce LexRubric, a rubric-based benchmark for evaluating open-ended Chinese legal tasks. LexRubric contains 649 instances from legal consultation and judicial examination, which reflect both everyday legal needs and professional legal reasoning and cover 14 legal scenarios. It further includes 12,337 expert-written atomic scoring criteria organized under a unified six-dimensional framework, enabling accurate evaluation and diagnostic analysis across tasks and evaluation dimensions. To validate the reliability of the evaluation, we test multiple judge models and compare model-based judgments with human judgments. We further evaluate 18 recent general and legal-domain LLMs on LexRubric. Results show that different models exhibit distinct capability profiles, and that open-ended legal question remains challenging for current LLMs. Data is available at: https://github.com/foggpoy/LexRubric.