ADAG: Automatically Describing Attribution Graphs

2026-04-08Computation and Language

Computation and Language
AI summary

The authors developed ADAG, a fully automated system that helps explain how language models make certain decisions by tracing which internal features cause specific outputs. Unlike previous methods that required humans to manually interpret each part, ADAG uses attribution profiles to measure how features influence each other and groups similar features using a new clustering method. It also employs a language model to create understandable natural-language explanations for these feature groups. They tested ADAG on tasks previously studied by humans and showed it can find meaningful parts of the model, including ones that cause problematic behavior in Llama 3.1 8B Instruct.

circuit tracinglanguage modelsattribution profilesgradient effectsclustering algorithmfeature attributionLLM explainerinterpretabilityLlama 3.1model debugging
Authors
Aryaman Arora, Zhengxuan Wu, Jacob Steinhardt, Sarah Schwettmann
Abstract
In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations underlying some behaviour. However, all prior circuit tracing work has relied on ad-hoc human interpretation of the role that each feature in the circuit plays, via manual inspection of data artifacts such as the dataset examples the component activates on. We introduce \textbf{ADAG}, an end-to-end pipeline for describing these attribution graphs which is fully automated. To achieve this, we introduce \textit{attribution profiles} which quantify the functional role of a feature via its input and output gradient effects. We then introduce a novel clustering algorithm for grouping features, and an LLM explainer--simulator setup which generates and scores natural-language explanations of the functional role of these feature groups. We run our system on known human-analysed circuit-tracing tasks and recover interpretable circuits, and further show ADAG can find steerable clusters which are responsible for a harmful advice jailbreak in Llama 3.1 8B Instruct.