Inoculation Adapters: Improved Selective Generalization of Capabilities with Fewer Surprising Backdoors

2026-06-29Artificial Intelligence

Artificial Intelligence
AI summary

The authors developed a method called inoculation adapters to help AI models avoid learning bad or unwanted behaviors during training. This method works by teaching the model these bad traits first, then training on both good and bad traits together, but only keeping the part that learned the good traits at the end. They found that this approach reduces unwanted behaviors better than previous methods and causes fewer unexpected side effects. However, keeping the good behaviors strong still remains a difficult problem for both methods. Overall, the authors present a new way to help AI avoid learning things it shouldn't, with some improvements but also ongoing challenges.

inoculation promptinginoculation adaptersemergent misalignmentLoRAtask adapterundesired traitsoptimization pressurebackdoorsmodel trainingtrait suppression
Authors
Maxime Riché, Daniel Tan, Vili Kohonen, Niels Warncke
Abstract
Inoculation prompting is a selective generalization technique used against Emergent Misalignment. We introduce inoculation adapters (IA), which similarly diminish the optimization pressure to learn undesired traits by strengthening the trait at train time. Inoculation adapters are LoRAs that are trained and used over three steps: 1) trained on undesired traits; 2) attached frozen while a separate task adapter is trained on data exhibiting both desired and undesired traits; 3) at deployment, the IA is discarded, and only the task adapter is kept. We show across six model families and several undesired traits including emergent misalignment, that inoculation adapters are more effective at suppressing undesired traits, while avoiding two drawbacks of inoculation prompting: inoculation adapters can suppress capabilities and traits that cannot be reliably elicited by a prompt, and they introduce fewer surprising backdoors than inoculation prompting under our probes. While undesired traits are better suppressed by inoculation adapters, the retention of desired traits is not consistently improved upon inoculation prompting and remains a challenge for both techniques.