Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation

2026-06-01Computation and Language

Computation and Language
AI summary

The authors studied how well large language models (LLMs) can judge long pieces of text, which is harder than judging short ones. They created a new benchmark called LongJudgeBench to test different LLM judges in various real-world scenarios. Their findings show that current LLM judges are not consistently reliable for long-form evaluations, and while using specific guidelines or examples helps, it is not enough. The authors hope their work will help improve how LLMs judge longer text in the future.

large language modelslong-form generationevaluation benchmarkingLLM-as-a-judgemeta-evaluationrubricscontext-aware judgmenthuman-aligned evaluation
Authors
Junjie Chen, Yuxi Dong, Haitao Li, Weihang Su, Yujia Zhou, Min Zhang, Yiqun Liu, Qinyao Ai
Abstract
As large language models (LLMs) are increasingly used for long-form generation, reliably evaluating long-form outputs has become a critical challenge. LLM-as-a-judge offers a scalable alternative to human evaluation, yet its reliability in long-form output evaluation remains underexamined: existing meta-evaluation benchmarks focus mainly on short-form outputs. Compared with short-form evaluation, long-form evaluation is not merely a matter of output length; it often requires judges to handle more complex document-level demands. In this work, we introduce LongJudgeBench, a comprehensive benchmark for evaluating LLM judges on long-form outputs across diverse real-world scenarios and judging protocols. We systematically evaluate a broad range of LLM judges, covering multiple base models and judging settings. Our results reveal a substantial reliability gap: current LLM judges remain unstable across scenarios, and rubrics or references are helpful but not always sufficient. We hope LongJudgeBench will support future research on more robust, context-aware, and human-aligned LLM-as-a-judge methods. Our code is available at https://anonymous.4open.science/r/LongJudgeBench-F782.