Can LLMs Time Travel? Enhancing Temporal Consistency in Legal Agentic Search through Reinforcement Learning

2026-05-25 • Computation and Language

Computation and LanguageArtificial Intelligence

AI summaryⓘ

The authors point out that current legal language models often make mistakes because they ignore when laws apply, which is important in law. They found these models are biased toward the legal information available up to their training cutoff date and that typical search methods don’t handle timing properly. To fix this, they created LegalSearch-R1, a system that uses both local legal databases and web searches while paying careful attention to the correct timing of laws. Their system performed better than existing models on several legal tasks, especially in keeping legal timing correct. They also shared their code and data for others to use.

large language modelslegal reasoningtemporal contextreinforcement learningretrieval-augmented generation (RAG)statutelegal precedenttemporal biasout-of-domain generalizationlegal search agents

Authors

Wei Fan, Yining Zhou, Mufan Zhang, Yanbing Weng, Yiran HU, Tianshi Zheng, Baixuan Xu, Chunyang Li, Jianhui Yang, Haoran Li, Yangqiu Song

Abstract

While large language models (LLMs) augmented with agentic search capabilities show promise for legal reasoning, they overlook a fundamental constraint that applicable law must match the temporal context of each case, as retroactive application of statutes violates core legal principles and leads to erroneous conclusions. Our observations reveal that current legal LLMs suffer from temporal bias anchored to their training cutoff, while search agents rarely incorporate temporal constraints into queries, and that web search alone cannot provide the precise statute and precedent citations that legal reasoning demands. To address these challenges, we propose LegalSearch-R1, an end-to-end reinforcement learning framework that pairs local statute RAG for precise article matching with online web search for broader legal knowledge, trained on temporally-indexed data spanning multiple amendment periods to enforce temporal consistency. Extensive experiments on our benchmark covering 13 legal tasks demonstrate that our 7B-parameter agent outperforms state-of-the-art deep research frameworks and specialized legal LLMs by 12.9% to 29.8%, surpasses baselines by 57.7% to 80.3% on temporal consistency, and exhibits robust out-of-domain generalization. The code and data are available at https://github.com/AlexFanw/LegalSearch-R1.

View PDFOpen arXiv