AI summaryⓘ
The authors developed a method called Universal Activation Verbalizer (UAV) that helps explain what different AI language models are thinking by translating their internal signals into understandable words. Unlike previous approaches, which only explained each model’s own activations, UAV uses a shared explanation system for various models, even those of different types or sizes. UAV learns small adapters to convert the models’ signals into a common format, allowing explanations to transfer across models. The authors found that UAV works well on different language tasks and that tuning the explanation decoder mainly helps with the tasks, while the adapter ensures the explanations are accurate and meaningful.
Activation verbalizationNatural language processingNeural activationsTransfer learningAdaptersDecoderLoRACross-model explanationClassification tasksFact retrieval
Authors
Haiyan Zhao, Zirui He, Guanchu Wang, Ali Payani, Yingcong Li, Mengnan Du
Abstract
Activation verbalization explains hidden representations in natural language, but existing methods are mostly limited to self-explanation, where each model explains only its own activations. We introduce Universal Activation Verbalizer (UAV), a framework that uses a shared decoder to explain activations from heterogeneous donor models. UAV learns a lightweight adapter that converts donor activations into soft tokens in decoder's embedding space, and further supports adapter-only transfer by reusing a frozen decoder-side LoRA while training only a new adapter for another donor. Across classification, fact retrieval, and gist summarization, UAV remains competitive with strong self-explanation baselines while enabling cross-model verbalization across model families and scales. Ablations show that decoder-side tuning mainly improves task behavior, whereas the adapter provides the activation-grounded factual and semantic information needed for faithful explanations.