Reliable to Expressive: A Curriculum for Rubric-Following Safety Judges

2026-06-08 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors study how safety judges, which check if AI-generated content is safe, often fail when the rules (rubrics) they follow change or are phrased differently. They suggest training these judges with varied, example-based rubrics and gradually increasing complexity so the judges learn to apply rules more flexibly. Their method showed better accuracy and consistency across different rubrics than existing models, proving that careful training improves judge reliability. They found that mixing varied rubrics without a careful schedule made performance worse, highlighting the importance of their step-by-step training approach.

safety judgesrubricevaluation criteriacurriculum learningprompt variationsmodel robustnesssupervised fine-tuningfalse negative ratelarge language modelsmeta-evaluation

Authors

Yongtaek Lim, Hyeji Choi, Minwoo Kim

Abstract

Safety judges are increasingly deployed to evaluate model outputs against evolving criteria, yet recent meta-evaluation work shows they remain brittle under prompt and rubric variation, with false negative-rate swings of up to 0.24 reported for stylistic perturbations alone. We argue that safety judgment is fundamentally a rubric-following problem: a robust judge must apply the given evaluation criteria consistently across rubric formulations rather than memorize one specific template. We propose a training strategy that combines (i) instance-conditioned dynamic rubrics generated from prompt-response-label triples to expose the judge to the variability of evaluation criteria, and (ii) a reliable-to-expressive curriculum that begins with clean fixed-rubric supervision and progressively introduces noisier dynamic-rubric data. We evaluate on a single human-labeled set under three contrasting rubric prompts (HarmBench-style, ShieldGemma-style, and a domain-specific rubric). Our 12B curriculum judge achieves 94.12-94.88% accuracy across the three rubrics with a cross-rubric range of only 0.76, outperforming general-purpose LLMs, dedicated safety classifiers, and reasoning-oriented judges up to 30B in both peak accuracy and stability. An ablation shows that naively mixing dynamic rubrics into SFT increases cross rubric variance (1.44 -> 3.60); only the curriculum schedule recovers and improves on the fixed rubric baseline (variance 0.76).

View PDFOpen arXiv