Beyond Probabilistic Similarity: Structural, Temporal, and Causal Limitations of Retrieval-Augmented Generation in the Legal Domain

2026-06-08 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors explain that problems in legal AI—like fake citations and outdated information—aren't just random errors but come from a mismatch between how these systems search for information and the complex, layered, and time-sensitive nature of legal knowledge. They identify three main blind spots in current retrieval methods: missing the law's hierarchical structure, ignoring changes over time, and lacking clear cause-and-effect links in legal decisions. Reviewing existing AI approaches, they find these issues are only partially addressed and propose a new framework focusing on clear, time-aware, and structured retrieval to better match legal reasoning. Their work mainly deals with figuring out which laws apply when and how, especially in legislative and constitutional contexts.

Retrieval-Augmented GenerationLegal AIHierarchical StructureDiachronic DynamismInstitutional ProvenanceOntological CommitmentMereological BlindnessBitemporal CorrectnessDeterministic ProtocolsQuaestio Juris

Authors

Hudson de Martim

Abstract

Retrieval-Augmented Generation (RAG) has become a standard architectural response to unreliability in legal AI, yet high-profile failures, including fabricated citations submitted to courts and anachronistic legal content presented as current, continue to appear across jurisdictions. We argue that these failures are not residual confabulations to be eliminated by scaling language models, but symptoms of an architectural mismatch between probabilistic retrieval and the hierarchical, temporal, and institutional structure of legal knowledge. We develop the argument in three moves. First, we articulate the ontological commitment of legal knowledge as a triad of properties derivable from classical legal theory: hierarchical and mereological structure, diachronic dynamism under operational closure, and causal traceability of institutional provenance grounded in the duty of justification. Second, we identify three corresponding pathologies of retrieval (mereological blindness, diachronic blindness, and causal opacity), each developed with an operational definition, a failure mechanism, a canonical example, and detection criteria for diagnostic use. Third, we review the state of the art through this lens, showing that existing approaches address these requirements unevenly and do not yet compose into a paradigm that treats them as co-constitutive. From this analysis we derive four architectural commitments that characterize the deterministic-by-design direction for legal retrieval: ontological primacy, event reification, bitemporal correctness, and deterministic interaction protocols. The framework concerns quaestio juris (which norms apply and in what state) rather than the downstream tasks that act on identified norms, and addresses legislative and constitutional retrieval primarily, with interpretive time as an explicit extension.

View PDFOpen arXiv