Consistency Training while Mitigating Obfuscation via Rate Matching

2026-06-01Computation and Language

Computation and LanguageArtificial Intelligence
AI summary

The authors studied how language models can be influenced by extra clues, like hints about what answer a user wants, which can make the model biased. They found that typical methods to reduce this bias force the model to avoid saying these clues but don’t stop the bias itself, making it hard to check if the problem is actually fixed. To fix this, they created a new method called RMCT that focuses on keeping the model's biased behavior consistent without forcing it to hide the clues. Their experiments showed RMCT reduced bias similarly to older methods while keeping the model more transparent about its biases, and it used less data but more computing power. This suggests it’s possible to make models less biased and still easy to monitor.

Large Language ModelsConsistency TrainingBias MitigationExtraneous Input FeaturesBehavioral RobustnessSycophancyRate Matching Consistency TrainingModel MonitorabilityInput PerturbationsBehavioral Properties
Authors
Sohaib Imran, Prakhar Gupta, Jannes Elstner, David Demitri Africa
Abstract
Large language models are often influenced by extraneous input features, such as cues revealing a user's preferred answer. Consistency training reduces this influence by training models to behave similarly across inputs with and without the extraneous feature. However, existing methods train for consistency over entire responses or internal activations, which also constrains whether the model verbalises said extraneous features. We show this leads to obfuscation, where the model learns not to mention a cue while remaining influenced by it, which may undermine monitorability. To address this, we introduce Rate Matching Consistency Training (RMCT), which trains for consistency over selected behavioural properties without constraining how this behaviour is expressed. RMCT matches the rate at which the model exhibits a target behaviour (e.g., following a bias cue) across input perturbations, rather than requiring paired inputs with and without the extraneous feature, extending consistency training to settings where the extraneous features cannot be removed. We evaluate RMCT on sycophancy reduction in two open-weight language models, achieving reductions in bias-following comparable to a standard consistency-training baseline on held-out bias types, while largely preserving the model's tendency to verbalise the bias cue. Further, we find that RMCT is more data-efficient at the expense of being less compute-efficient in our experiments. Overall, RMCT shows that consistency training can improve behavioural robustness without directly trading off against monitorability.