Facet-Level Tracing of Evidence Uncertainty and Hallucination in RAG

2026-04-10Computation and Language

Computation and Language
AI summary

The authors studied a method called Retrieval-Augmented Generation (RAG), which tries to reduce mistakes by basing answers on documents it finds. They created a new way to break down questions into small parts (facets) and check if the right evidence was used correctly for each part. By comparing different ways of using evidence—from strictly relying on found documents to ignoring them—they discovered that errors happen more when the system doesn’t use the right evidence properly, rather than not finding it. Their detailed approach showed common mistakes that standard tests usually miss.

Retrieval-Augmented Generationhallucinationquestion answeringevidence groundingnatural language inferencelarge language modelsretrieval relevancefacet-level analysisparametric knowledge
Authors
Passant Elchafei, Monorama Swain, Shahed Masoudian, Markus Schedl
Abstract
Retrieval-Augmented Generation (RAG) aims to reduce hallucination by grounding answers in retrieved evidence, yet hallucinated answers remain common even when relevant documents are available. Existing evaluations focus on answer-level or passage-level accuracy, offering limited insight into how evidence is used during generation. In this work, we introduce a facet-level diagnostics framework for QA that decomposes each input question into atomic reasoning facets. For each facet, we assess evidence sufficiency and grounding using a structured Facet x Chunk matrix that combines retrieval relevance with natural language inference-based faithfulness scores. To diagnose evidence usage, we analyze three controlled inference modes: Strict RAG, which enforces exclusive reliance on retrieved evidence; Soft RAG, which allows integration of retrieved evidence and parametric knowledge; and LLM-only generation without retrieval. Comparing these modes enables thorough analysis of retrieval-generation misalignment, defined as cases where relevant evidence is retrieved but not correctly integrated during generation. Across medical QA and HotpotQA, we evaluate three open-source and closed-source LLMs (GPT, Gemini, and LLaMA), providing interpretable diagnostics that reveal recurring facet-level failure modes, including evidence absence, evidence misalignment, and prior-driven overrides. Our results demonstrate that hallucinations in RAG systems are driven less by retrieval accuracy and more by how retrieved evidence is integrated during generation, with facet-level analysis exposing systematic evidence override and misalignment patterns that remain hidden under answer-level evaluation.