From Refusal Geometry to Safety Geometry: Harmfulness--Refusal Coupling under Dynamic Adversarial Fine-Tuning
2026-06-15 • Cryptography and Security
Cryptography and Security
AI summaryⓘ
The authors explore how language models learn to refuse harmful requests without losing their ability to answer safe ones. They introduce a method to measure if a model understands what is harmful, if it knows when to refuse, and how these two things work together. By studying different training approaches, they find that some models initially link harmfulness and refusal closely, losing some ability to answer safe questions, while later training reduces this link but may let some harmful requests slip through. They also show that just having a weaker link between harmfulness and refusal doesn't guarantee safety and highlight ways to analyze these behaviors during fine-tuning.
language modelssafety alignmentharmfulnessrefusal policyinstruction tuningfine-tuningMistral-7BR2D2robustnessadversarial training
Authors
Wenhao Lan, Shan Li, Xinhua Lai, Meiqi Wu, Junbin Yang, Haihua Shen, Yijun Yang
Abstract
Safety alignment requires language models to refuse harmful requests without losing the ability to answer benign ones. Existing robustness evaluations, however, do not reveal whether a model has learned to recognize harmfulness, to activate a refusal policy, or to couple these two processes. We study this question with a dual safety-geometry protocol that measures harmfulness carriers, refusal carriers, and their coupling across aligned instruction-tuned anchors and matched Mistral-7B-v0.1 SFT/R2D2 training trajectories. The aligned anchors validate the protocol: refusal-side interventions reopen attack success more strongly than harmfulness-only interventions, while harmfulness and refusal carriers remain nearly orthogonal. Along the Mistral trajectory, R2D2 exhibits a high-coupling early phase with strong fixed-source robustness, saturated safe-prompt refusal, and collapsed benign utility. Later checkpoints move to a lower-coupling regime with partial utility recovery and reopened attack success. SFT provides an important contrast: it also reaches low coupling, but remains substantially less robust, showing that low coupling alone is not a safety guarantee. All-anchor diagnostics and sparse GCG/AutoDAN transfer experiments further show that H/R coupling is informative in the R2D2 regime, whereas SFT transfer is better summarized by drift or behavior-state measures. Causal sweeps support fixed-protocol sensitivity relative to matched unit-direction controls, but do not establish independent harmfulness and refusal pathways. These results frame harmfulness--refusal coupling as an operational diagnostic for safety-geometry dynamics under adversarial fine-tuning.