Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

2026-06-08 • Computation and Language

Computation and LanguageArtificial IntelligenceMachine Learning

AI summaryⓘ

The authors tackle the challenge of improving AI systems that test security by playing attackers and defenders against each other. They created a new training method called AdvGRPO that makes a previously unstable algorithm stable for this attacker-defender learning. Their approach uses detailed feedback and gradually increases difficulty from simple to complex attacks, updating attacker and defender models alternately. They found their method produces strong attacks and defenders that work better than usual on safety tests.

AI red teamingreinforcement learningco-trainingGRPOPPODPOadvantage normalizationcurriculum learningmulti-turn attackssafety benchmarks

Authors

Blake Bullwinkel, Eugenia Kim, Amanda Minnich, Mark Russinovich

Abstract

AI red teaming must continually adapt to evolving attackers and defenders. Reinforcement learning offers a promising approach to discovering novel attacks, and co-training methods can produce more robust defenders in tandem. Recent works have demonstrated the efficacy of attacker-defender co-training by applying PPO and DPO, but report that GRPO is unstable in this setting. We introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization using dense multi-channel rewards and decoupled advantage normalization. Training progresses through a curriculum from single-turn to closed-loop multi-turn attacks before bootstrapping co-training, where attacker and defender models are updated in alternation. We show that our method can produce highly effective and transferable attacks and that co-trained defenders outperform baselines on safety benchmarks.

View PDFOpen arXiv