LegalCiteBench: Evaluating Citation Reliability in Legal Language Models

2026-05-11 • Computation and Language

Computation and LanguageArtificial Intelligence

AI summaryⓘ

The authors created LegalCiteBench, a test to see how well large language models (LLMs) can accurately provide legal case citations without using outside sources. They found that even the best models struggle a lot, often giving wrong or made-up citations, which could cause serious issues in legal work. Trying bigger models or training on legal texts didn’t fix these problems much. The benchmark helps researchers understand when and why models mess up citing legal cases and explores ways to reduce confident but incorrect answers.

Large Language ModelsLegal CitationsCase LawCitation RetrievalCitation VerificationClosed-Book SettingMisleading Answer RateLegal BenchmarkCase MatchingAuthority Generation

Authors

Sijia Chen, Hang Yin, Shunfan Zhou

Abstract

Large language models (LLMs) are increasingly integrated into legal drafting and research workflows, where incorrect citations or fabricated precedents can cause serious professional harm. Existing legal benchmarks largely emphasize statutory reasoning, contract understanding, or general legal question answering, but they do not directly study a central common-law failure mode: when asked to provide case authorities without external grounding, models may return plausible-looking but incorrect citations or cases. We introduce LegalCiteBench, a benchmark for studying closed-book citation recovery, citation verification, and case matching in legal language models. LegalCiteBench contains approximately 24K evaluation instances constructed from 1,000 real U.S. judicial opinions from the Case Law Access Project. The benchmark covers five citation-centric tasks: citation retrieval, citation completion, citation error detection, case matching, and case verification and correction. Across 21 LLMs, exact citation recovery remains highly challenging in this closed-book setting: even the strongest models score below 7/100 on citation retrieval and completion. Within the evaluated models, scale and legal-domain pretraining provide limited gains and do not resolve this difficulty. Models also frequently provide concrete but incorrect or low-overlap authorities under our evaluation protocol, with Misleading Answer Rates (MAR) exceeding 94% for 20 of 21 evaluated models on retrieval-heavy tasks. A prompt-only abstention experiment shows that explicit uncertainty instructions reduce some confident fabrication but do not improve citation correctness. LegalCiteBench is intended as a diagnostic framework for studying authority generation failures, verification behavior, and abstention when external grounding is absent, incomplete, or bypassed.

View PDFOpen arXiv