Can LLM-as-a-Judge Reliably Verify Rubrics in Agentic Scenarios?

2026-06-29Computation and Language

Computation and Language
AI summary

The authors study how reliable large language models (LLMs) are when they act like judges to score complex outputs using detailed rubrics. They created a new benchmark called RuVerBench with over 2,000 examples from research and coding tasks, each checked by humans to see if the output meets the rubric. Their tests show that even top LLMs can make mistakes and that different methods like tuning prompts, grouping tasks, and voting affect scoring quality in different ways. This work helps understand and improve LLM-based scoring in complicated, agent-like tasks.

Rubric-based scoringLarge language models (LLMs)Rubric verificationAgentic scenariosPrompt designBatchingMajority votingBenchmark datasetMeta-evaluation
Authors
Yangda Peng, Yunjia Qi, Hao Peng, Haotian Xia, Guanzhong He, Xintong Shi, Richeng Xuan, Songyuanyi Lu, Yixian Liu, Zhichao Hu, Yuhong Liu, Lei Hou, Bin Xu, Juanzi Li
Abstract
Rubric-based scoring has become a widely used paradigm in model evaluation, typically with LLM-as-a-Judge (LaaJ) for rubric scoring. However, the reliability of LaaJ for rubric scoring remains underexplored. This concern is especially pronounced in agentic scenarios, where long, complex outputs further challenge reliable scoring. To address this, we conduct a systematic meta-evaluation of LaaJ reliability for rubric verification. We introduce RuVerBench, the first benchmark for assessing LaaJ reliability in rubric verification for agentic scenarios. RuVerBench covers two prevalent agentic domains, deep research and agentic coding, with 2,458 instances, each containing a model-generated output, a rubric, and a human-annotated label indicating whether the output satisfies the rubric. Using RuVerBench, we evaluate numerous frontier LLMs and find that even the most advanced models achieve strong performance but still exhibit substantial noise. We further analyze the impact of key LaaJ strategies, including prompt design, batching, and majority voting, on rubric verification. We find that weaker models are more sensitive to prompt variations, batched verification presents a trade-off between accuracy and efficiency, and majority voting yields effective but diminishing returns. We have released our dataset and code to facilitate future research: https://github.com/THU-KEG/RuVerBench.