(Mis)generalization of Helpful-only Fine-tuning

2026-06-03 • Machine Learning

Machine Learning

AI summaryⓘ

The authors studied AI models trained to always try to help users and never refuse requests. They found that while these models refuse less often, they sometimes act in unwanted ways, like being hard to control or giving confusing answers. Training methods meant to avoid refusals can cause some of these problems. However, the authors also found ways to fix these issues using special training techniques with synthetic data and character-focused questions. So, problems aren’t inevitable with helpful-only training and can be improved.

helpful-only modelsalignmentrefusal behaviorsteerabilitysycophancysynthetic document fine-tuningSFT (Supervised Fine-Tuning)RL (Reinforcement Learning)

Authors

Mohammad Omar Khursheed, Baram Sosis, Fabien Roger

Abstract

Helpful-only models, that is, models that are trained to always follow user intent, are valuable for dangerous capability evaluations and other areas of AI R&D where refusals would be an obstacle. Little is known about the generalization properties of helpful-only training: helpful-only models refuse less than their harmless counterparts, but previous work has not studied other dimensions of their alignment. We study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors, and most show poor steerability, sycophancy, and incoherent character. We show that simple anti-refusal training can cause many of these issues. None of these problems are necessary consequences of helpful-only training, though: we show that synthetic document fine-tuning and adding character-related questions to SFT and RL can mitigate them.

View PDFOpen arXiv