GRACE: Step-Level Benchmark for Faithful Reasoning over Context
2026-06-15 • Computation and Language
Computation and Language
AI summaryⓘ
The authors study how AI models explain their reasoning steps when answering questions or solving problems. They created GRACE, a new checklist that shows which reasoning steps are trustworthy and what type of mistakes happen, based on human review of many examples. This helps identify exactly where and why models go wrong, not just if the final answer is right or wrong. They also found that using this detailed feedback can help improve AI models' accuracy and reliability.
Chain-of-Thought promptingfaithfulnesshallucinationsreasoning errorsbenchmark datasetreinforcement learningdeductive reasoningfactual groundingerror taxonomycontext-grounded reasoning
Authors
Hoang Pham, Dong Le, Anh Tuan Luu
Abstract
Many reasoning tasks require models to reason over input context, from document-grounded question answering to rule-based deduction. Chain-of-Thought (CoT) prompting produces traces that appear transparent, yet individual steps can silently deviate from the source evidence, even when the final answer is correct. Existing methods detect hallucinations at the response level but fail to identify where in the chain a failure occurs or what type it is. We introduce GRACE, the first human-annotated step-level faithfulness benchmark with a data-driven error taxonomy for context-grounded textual reasoning. GRACE covers CoT traces from 10 models across 4 source datasets, with each step annotated for faithfulness, error category, and natural language explanation. A data-driven taxonomy, discovered bottom-up via unsupervised clustering, organizes failures into two tracks: GRACE-Inference (deductive errors) and GRACE-Grounding (factual grounding errors), with four categories each. The evaluation set is human-annotated and challenging by design. Our experiments reveal substantial headroom for current models. In addition, integrating step-level faithfulness signals into reinforcement learning pipelines improves both downstream accuracy and reasoning reliability.