SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt Optimisation
2026-05-25 • Computation and Language
Computation and LanguageArtificial Intelligence
AI summaryⓘ
The authors developed SafeCtrl-RL, a method to help large language models avoid unsafe or harmful responses without changing the model itself. Instead, their system adjusts the prompts the model sees during conversations by using a reinforcement learning agent that makes decisions based on the context. This process helps the model 'unlearn' unsafe behaviors in real-time as it generates replies. They tested this approach on different models and unsafe scenarios, finding it improved safety and response quality better than other prompting methods.
Large Language ModelsReinforcement LearningPrompt EngineeringInference-time ControlBehavioral SafetyDialogue SystemsSequential Decision ProcessModel UnlearningResponse QualityUnsafe Behavior Mitigation
Authors
Michael Orme, Yanchao Yu, Zhiyuan Tan
Abstract
Ensuring safe and contextually appropriate behaviour in Large Language Models (LLMs) remains a critical challenge for real-world deployment. We present \textbf{SafeCtrl-RL}, an inference-time behavioural control framework that enables adaptive safety regulation without model retraining or parameter modification. The method formulates dialogue generation as a sequential decision process, where a reinforcement learning agent dynamically selects prompt adjustment strategies based on contextual feedback. This allows unsafe behaviours to be suppressed through iterative refinement, which we conceptualise as inference-time behavioural unlearning. Evaluated across multiple LLMs and unsafe dialogue scenarios, SafeCtrl-RL consistently improves safety and response quality, outperforms existing prompt-based optimisation methods, and achieves favourable performance--efficiency trade-offs. **Warning: This paper may contain examples of harmful language, and reader discretion is recommended.