Cross-Family Universality of Behavioral Axes via Anchor-Projected Representations

2026-05-11Artificial Intelligence

Artificial Intelligence
AI summary

The authors created a method to compare and transfer behaviors between different language models, which normally have different internal setups. They map each model's hidden information into a shared space where they can find common directions representing behaviors. By averaging these directions, they create a universal guide that can be applied back to new models without extra training. Their tests show this approach works well across several popular model families and tasks, indicating that some behavioral traits are shared and transferable. This helps better understand and steer different models without needing to start from scratch each time.

large language modelshidden representationstokenizerinstruction tuningbehavioral directionsanchor coordinate spacecross-model transferAUROCrefusal ratemodel interpretability
Authors
Su-Hyeon Kim, Yo-Sub Han
Abstract
Large language models from different families use different hidden dimensions, tokenizers, and training procedures, making behavioral directions difficult to compare or transfer across models. We introduce an anchor-projection framework that maps hidden representations from each model into a shared anchor coordinate space (ACS). Behavioral directions extracted from source models are projected into ACS and averaged into a canonical direction. For a new model, the canonical direction is reconstructed into its native hidden space using only anchor activations, without fine-tuning or target-specific direction extraction. We evaluate five instruction-tuned model families and ten behavioral axes. We find that same-axis directions align tightly across the Llama-Qwen-Mistral-Phi (LQMP) cluster in ACS. This shared structure transfers to downstream tasks. For the aligned LQMP cluster, held-out targets achieve (0.83) ten-way detection accuracy and (0.95) mean binary AUROC, while canonical steering induces refusal-rate shifts of up to +0.46% under distribution shift. Sensitivity analyses show that two source models and small anchor pools already suffice to approximate transferable directions. Overall, ACS provides a novel perspective on cross-family interpretability, revealing that representation-level transfer remains robust across model families.