SafeCtrl-RL: Inference-Time Adaptive Behaviour Control for LLM Dialogue via RL-Driven Prompt Optimisation

2026-05-25 • Computation and Language

Computation and LanguageArtificial Intelligence

AI summaryⓘ

The authors developed SafeCtrl-RL, a method to help large language models avoid unsafe or harmful responses without changing the model itself. Instead, their system adjusts the prompts the model sees during conversations by using a reinforcement learning agent that makes decisions based on the context. This process helps the model 'unlearn' unsafe behaviors in real-time as it generates replies. They tested this approach on different models and unsafe scenarios, finding it improved safety and response quality better than other prompting methods.

Large Language ModelsReinforcement LearningPrompt EngineeringInference-time ControlBehavioral SafetyDialogue SystemsSequential Decision ProcessModel UnlearningResponse QualityUnsafe Behavior Mitigation

Authors

Michael Orme, Yanchao Yu, Zhiyuan Tan

Abstract

Ensuring safe and contextually appropriate behaviour in Large Language Models (LLMs) remains a critical challenge for real-world deployment. We present \textbf{SafeCtrl-RL}, an inference-time behavioural control framework that enables adaptive safety regulation without model retraining or parameter modification. The method formulates dialogue generation as a sequential decision process, where a reinforcement learning agent dynamically selects prompt adjustment strategies based on contextual feedback. This allows unsafe behaviours to be suppressed through iterative refinement, which we conceptualise as inference-time behavioural unlearning. Evaluated across multiple LLMs and unsafe dialogue scenarios, SafeCtrl-RL consistently improves safety and response quality, outperforms existing prompt-based optimisation methods, and achieves favourable performance--efficiency trade-offs. **Warning: This paper may contain examples of harmful language, and reader discretion is recommended.

View PDFOpen arXiv