AuthTrace: Diagnosing Evidence Construction in Thematically Dense Single-Author Corpora

2026-05-25Computation and Language

Computation and Language
AI summary

The authors created AuthTrace, a new test that lets them compare different methods used to gather evidence in question-answering systems on the same set of texts by one author. They found that remembering more correct evidence is more important than just finding precise evidence to answer questions well. Also, methods that collect evidence in simple ways struggle more compared to ones that use structured, organized evidence. Finally, just showing the whole text to a model without focusing on evidence doesn't work well, meaning systems need to build evidence actively to do better.

evidence constructionchunk retrievalknowledge-graph traversalthematic indexingdiagnostic benchmarkevidence recallfan-in gradientfull-context promptingquestion answering (QA) models
Authors
Xiaoqing Wu, Feifei Li, Haoliang Ming, Wenhui Que
Abstract
Evidence construction systems--chunk retrieval, agent memory, knowledge-graph traversal, and thematic indexing--are evaluated on separate benchmarks with incompatible corpora and metrics, making cross-paradigm diagnosis impossible. We introduce AuthTrace, the first diagnostic benchmark that places all major paradigms on a single corpus and query set by exploiting the dual nature of single-author collections. Built on thematically dense corpora where all texts share style, topic, and vocabulary, AuthTrace provides 2,099 instances with exhaustive gold evidence and a fan-in gradient as the primary diagnostic axis. Comparing eight systems across two QA models, we find that (1) evidence recall--not precision--is the dominant predictor of answer quality (r = 0.96); (2) fan-in exposes paradigm-specific collapse patterns, with flat retrieval degrading 3x faster than structured-evidence systems; and (3) full-context prompting fails uniformly, establishing evidence construction as a necessary capacity beyond raw corpus exposure.