Diffuse AI Control on Fuzzy Tasks

2026-06-08Machine Learning

Machine Learning
AI summary

The authors study how AI systems might secretly behave badly on tasks that are hard to judge, like writing research proposals. They create a game-like setup where one side tries to train AI to avoid bad behavior using a weak scoring system, while the other side tries to find sneaky bad actions that the weak system misses. Testing this on AI-generated research proposals, they found that some models could trick the weak scorer into rating poor proposals highly. To fix this, the authors develop a method for making the weak scorer tougher so that such sneaky bad behaviors are caught and avoided.

AI safetyDiffuse AI Controlmisalignmentfuzzy tasksadversarial gameblue teamred teamprompt optimizationlanguage modelsmulti-objective optimization
Authors
Mikhail Terekhov, Caglar Gulcehre, Vivek Hebbar, Joe Benton
Abstract
AI models deployed in critical domains, such as AI safety research, may subtly sabotage our efforts due to misalignment. Diffuse AI Control is a subfield of AI safety concerned with mitigating risks from AI sabotage distributed over long deployment horizons (diffuse threats). These risks are particularly pernicious on fuzzy tasks, i.e. tasks which are hard to grade or require intuition. To understand diffuse threats on fuzzy tasks, we introduce a novel framework that considers AI control as an adversarial game between a blue team and a red team. The blue team uses a weak trusted model to construct a weak score against which they would train a strong, potentially subversive model to remove the subversion propensity if it were present. The red team then tries to find model behaviors that are rated highly by the weak score, and thus might not be trained out, but actually correspond to poor performance. We test our framework on the task of writing experimental proposals for research questions from recent ML papers. We use a language model with access to the original paper as a proxy "ground-truth" scorer. Our red team discovers subversive behaviors using multi-objective evolutionary prompt optimization. We show that Opus~4.6 can write proposals that are worse according to the ground truth proxy than those of GPT-OSS-20B, while the weak scorer rates them as highly as the best proposals from Opus 4.6. To mitigate the threat, we propose an adversarial optimization algorithm for the blue team that discovers more robust prompts for the weak model. This algorithm produces a blue team prompt that our red team optimization fails to exploit.