Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

2026-06-08Machine Learning

Machine LearningArtificial Intelligence
AI summary

The authors study if groups of attention heads in large language models, identified by clustering how they activate together, actually represent true functional circuits. They test this by removing these groups and comparing the impact on the model to random groups. For two dense models, these groups seem to matter, but for a Mixture-of-Experts model, the groups found did not behave as expected when removed. They argue that clustering co-activation is just a hypothesis about circuits, and experimental tests like ablation are needed to confirm them.

interpretabilityattention headsclusteringco-activationablationclosure testMixture-of-Expertslanguage modelscircuit analysis
Authors
Yongzhong Xu
Abstract
Interpretability increasingly treats groups of components, not individual units, as the basic object, and proposes to find them by clustering co-activation statistics. We ask whether such a cheap signal actually identifies an attention-head circuit. Adapting a sparse-autoencoder clustering recipe to attention heads -- but validating by causal ablation rather than reconstruction -- we cluster heads and then run a closure test: ablate the discovered community and compare per-example damage to matched-random controls. Across two dense 1B-scale models (Pythia 1B, OLMo 1B) and two input distributions, the communities pass closure. In a Mixture-of-Experts model (OLMoE-1B-7B), route-conditional clustering recovers a statistically real signal that nonetheless does not survive closure -- ablation improves loss, the wrong direction. Extending closure across training, attention-target selectivity and participation ratio decouple from function in both directions. We conclude that a cheap signal is a circuit proposal, not a confirmed circuit; closure is what separates them.