Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

2026-03-25Computation and Language

Computation and LanguageArtificial IntelligenceComputers and SocietyInformation RetrievalMachine Learning
AI summary

The authors looked at how retrieval-augmented generation (RAG) systems can help analyze AI policy documents, which are complicated because of legal language and changing rules. They tested their system on a large set of AI policy documents, using methods to improve how the system finds and generates answers. They found that making the retrieval part better doesn't always make the final answers more accurate; sometimes it even causes the system to confidently give wrong answers when it can't find good information. This shows that improving parts of these systems doesn't always mean the whole system works better, especially in tricky policy areas. The study offers useful advice for building better question-answering tools based on changing regulations.

Retrieval-augmented generation (RAG)AI governancePolicy documentsColBERT retrieverContrastive learningDirect Preference Optimization (DPO)Question answeringHallucinations in AIRegulatory frameworksDomain adaptation
Authors
Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, Tunazzina Islam
Abstract
Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.