Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models

2026-06-15Artificial Intelligence

Artificial Intelligence
AI summary

The authors found that large reasoning models (LRMs), like advanced AI systems, can recognize when a question might be unsafe if shown the original query along with their own step-by-step thinking. They created a way to teach these models to tag and handle unsafe questions safely without affecting normal answers by fine-tuning and optimizing preferences using only model-generated data. Their method significantly lowers how often harmful or jailbreaking prompts succeed, while keeping general performance unaffected. This suggests LRMs have an internal 'safety awareness' that can be tapped to improve AI safety without heavy manual labeling.

Large Reasoning ModelsSafety AlignmentJailbreaksSupervised Fine-TuningDirect Preference OptimizationLatent Safety AwarenessAttack Success RateHarmful QueriesAI SafetyModel-generated Data
Authors
Ke Miao, Jiaxin Li, Hongliang Chen, Yuke Hu, Zhan Qin
Abstract
While Large Reasoning Models (LRMs) excel at complex tasks, they remain highly vulnerable to sophisticated jailbreaks and direct harmful queries. To address this vulnerability, prior works depend heavily on external manual data annotation for safety alignment. However, we observe that LRMs can inherently identify safety risks when being re-presented with original queries alongside their own reasoning trajectories -- a capability we term Latent Safety Awareness. To leverage this safety awareness, we first employ Supervised Fine-Tuning (SFT) to explicitly induce safe tags to trigger safety analysis and guidance following the initial reasoning content for unsafe queries, while preserving standard responses for general queries to ensure adaptive triggering. Subsequently, we apply Direct Preference Optimization (DPO) to further enhance the correctness and stability of the safety analysis and guidance. Notably, responses required for both training stages are entirely generated by models being optimized. With (Safe Trigger) SFT and DPO, experimental results demonstrate significant safety enhancement. For example, the Attack Success Rate (ASR) of DeepSeek-R1-Distill-Llama-8B, on average, drops 24.65% and 36.72% on harmful and jailbreak benchmarks, respectively. Finally, our Safe Trigger method exerts almost no negative impact on general performance or user experience.