CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models
2026-06-01 • Machine Learning
Machine Learning
AI summaryⓘ
The authors found that even a small amount (1%) of poisoned training data can hide harmful behaviors inside a model without showing in its outputs. They created CANARY, a tool that detects these hidden changes by analyzing the model’s internal states from a few test runs without needing labeled data. CANARY spots contamination much earlier than traditional methods and helps reduce harmful behavior by adjusting the model during use. It works across different models and training types, making supply-chain attacks easier to find and fix.
poisoning attacksfine-tuninghidden-state geometryautoencodersemantic driftAUROCsupply-chain contaminationlatent behaviorinference-time mitigationzero-label detection
Authors
Swapnil Parekh
Abstract
Adversaries can implant latent harmful behavior by poisoning as few as 1% of fine-tuning examples. The contamination is invisible to every output-level defense: harmful behavior lies dormant in the model's hidden-state geometry and does not appear in generated text until contamination exceeds 7.5%. We introduce CANARY (Contamination Auditor via Neural Activation Representation Yield), a zero-label checkpoint auditor that detects this hidden shift directly from two forward passes over an unlabeled prompt set. CANARY projects the hidden-state difference through a Sparse Autoencoder, filtering style noise to isolate meaningful semantic drift. It achieves AUROC = 1.000 at 1% contamination (95% CI = [0.997, 1.000]; Cohen's d = 3.28) across four model architectures and two training paradigms, 7.5x below where any output-level method fires, with zero false positives on benign fine-tuning and full robustness to style-matching and gradient-noise adaptive attacks. The same SAE feature basis drives a complete governance pipeline: SAE-filtered amplification surfaces latent harm at a 5x higher rate than standard generation; score-ranked prompts yield 4.2x red-teaming lift; and suppressing a handful of contamination-specific features at inference time reduces harm from 70% to 10% with no perplexity penalty. CANARY is the first zero-label framework to detect, verify, prioritize, and remediate supply-chain contamination from hidden states alone.