Detecting Unfaithful Chain-of-Thought via Circuit-Guided Internal-External Discrepancy

2026-05-25Artificial Intelligence

Artificial Intelligence
AI summary

The authors explore how to tell if the step-by-step reasoning generated by large language models (LLMs) truly matches how the model thinks internally. They note that past methods only looked at the final reasoning text without checking the model's inner workings. To improve this, they created CIE-Scorer, which traces important internal computations and compares them with the generated reasoning using a mathematical distance measure. Their approach works well on tests and is more efficient, showing that looking inside the model helps detect when reasoning may be unfaithful.

Chain-of-thought reasoningLarge language modelsFaithfulnessCircuit tracingMechanistic interpretabilityFused Gromov-Wasserstein distanceReasoning tracesInternal computationUnfaithfulness detectionReasoning graphs
Authors
Xu Shen, Zhen Tan, Song Wang, Pingjun Hong, Rui Miao, Xin Wang, Tianlong Chen
Abstract
Chain-of-thought (CoT) reasoning improves the problem-solving ability of large language models (LLMs), but generated reasoning traces may not faithfully reflect the model's actual decision process. Existing CoT unfaithfulness detectors mainly rely on external signals from generated rationales, such as textual plausibility or answer consistency, while overlooking evidence from the model's internal computation. Although recent circuit tracing methods provide a way to obtain model-internal evidence by tracing how information flows through model components during reasoning, constructing full reasoning circuits for long CoTs is costly and difficult to scale. To address these challenges, we propose Circuit-guided Internal-External Discrepancy Scorer (CIE-Scorer), a framework for instance-level CoT unfaithfulness detection. The key idea is that faithful reasoning traces should align with the model's computational process, whereas unfaithful traces may diverge from it. CIE-Scorer efficiently traces compact sentence-level circuits from informative reasoning tokens, constructs internal and external reasoning graphs, and measures their discrepancy using Fused Gromov--Wasserstein distance. Experiments on four datasets from FaithCoT-Bench show that CIE-Scorer achieves state-of-the-art performance while reducing the cost of circuit construction, demonstrating the effectiveness of combining mechanistic interpretability signals with external reasoning traces for CoT unfaithfulness detection.