IG-Lens: Exact Additive Probability Attribution Across Transformer Layers via Telescoping Integrated Gradients

2026-06-29Machine Learning

Machine Learning
AI summary

The authors study how to accurately find which layer in a decoder-only transformer is responsible for the probability of a predicted token. Existing methods either approximate this probability or work additively only before the softmax step, which is not the actual probability measure people want. They propose a new method called IG-Lens that uses Integrated Gradients through the softmax, allowing a precise layer-by-layer attribution of the exact token probability without error from approximations. Their approach sums up perfectly to the final probability change and can be efficiently computed in one pass.

decoder-only transformerstoken predictionIntegrated Gradientssoftmaxlogitslayer attributiongradient methodsadditivityprobability decomposition
Authors
Duc Anh Nguyen
Abstract
We ask a simple question about decoder-only transformers: \emph{between which two layers is the probability of a predicted token actually produced?} Existing layer-wise readout tools answer only approximately. The logit lens and its trained variant report a per-layer \emph{level} of probability but give no additive decomposition; their estimates are biased and non-monotone across depth. Direct Logit Attribution and related residual-stream methods are additive, but only in \emph{logit} space -- the softmax nonlinearity breaks additivity in probability space, precisely the quantity one usually cares about. Layer Conductance integrates gradients per layer, but attributes each to its own baseline and so does not sum to the total change in prediction. We introduce \textbf{IG-Lens}, a telescoping application of Integrated Gradients along a single path through the hidden states from a baseline to the final layer. Crediting each segment to the layer it terminates at yields a layer-wise attribution whose sum is \emph{exactly} the change in target probability, with the softmax inside the integration path rather than linearized away. Our default estimator credits each integration step its \emph{observed} change in target probability -- a prediction-aware reweighting in the spirit of IDGI -- rather than its raw gradient. Because the readout is a one-dimensional probability, this collapses each segment to a telescoping sum of endpoint values, so completeness holds exactly (to floating point) at \emph{any} step count, removing Riemann discretization error while suppressing steps that show gradient sensitivity without a change in output. We give the telescoping identity and its proof, verify completeness to floating point, and describe a single-pass batched implementation computing the full token-by-layer map without any backward call. Code: https://github.com/anhnda/IGLens.