Surrogate Fidelity: When Can Open LLMs Explain Closed Ones?
2026-06-30 • Machine Learning
Machine Learning
AI summaryⓘ
The authors study how understanding open language models can help us interpret closed (private) models when we only have limited access to their outputs. They test how well conclusions drawn from open models hold true for closed ones, especially looking at what the models predict, why they make those predictions (attributions), and how they represent information internally. They find that while models often agree on the final answers, they tend to disagree about the reasons behind those answers. This shows that having insight into open models doesn’t reliably transfer to closed models, and just matching predictions isn’t enough to understand their inner workings.
mechanistic interpretabilitylanguage modelslog-oddsattributionmodel representationwhite-box signalsblack-box input ablationsprediction fidelitycausal attributionAPI access
Authors
Philippe Chlenski, Zachariah Carmichael, Ayush Warikoo, Chia-Tse Shao, Yingxiao Ye, Aobo Yang, Vivek Miglani, Nehal Bandi
Abstract
Mechanistic interpretability (MI) requires full access to model internals, yet the APIs for most widely deployed language models at best expose log-probabilities over output tokens. This creates a surrogate problem: when do measurements made on open models allow us to make claims about a closed model? We evaluate surrogate fidelity at the prediction, attribution, and representation levels. For binary classification tasks, log-odds provide an API-compatible scalar readout of the model's representation space, and leave-one-out attributions provide insight into model behavior. Across eleven models spanning four families (Llama, Qwen, GPT, and Gemini), we find that prediction fidelity substantially overstates attribution fidelity: models that agree on what the answer is often disagree on why. We document an access-validity inversion: white-box signals like attention patterns and perturbation magnitudes are highly stable across models but only weakly predictive of causal attributions, which black-box input ablations capture by design. Mechanistic insight does not automatically transfer to closed targets, and prediction-level agreement is insufficient to warrant such transfer. Code and results are available at https://github.com/facebookresearch/surrogate.