When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift

2026-05-25Computation and Language

Computation and LanguageMachine Learning
AI summary

The authors study how to train models to learn preferences from weak signals but still perform well on different, unseen data. They find that models can seem good when tested on similar data but fail when applied to new preference datasets. The problem is that fine-tuning with weak labels can make the model focus too much on the original data features instead of learning general preferences. To fix this, they propose a method called Representation Anchoring, which helps the model keep useful knowledge from before while adapting to new tasks. This approach improves the model's ability to transfer to new data without losing performance on familiar data.

weak-to-strong generalizationpreference learningzero-shot transferfine-tuningrepresentation anchoringdistribution shiftreward modelingout-of-distributionregularizationtransfer learning
Authors
Khoi Le, Tri Cao, Phong Nguyen, Cong-Duy Nguyen, Anh Tuan Luu, Miao Chunyan, See-Kiong Ng, Thong Nguyen
Abstract
Weak-to-strong (W2S) generalization is a promising framework for scalable oversight, yet existing evaluations often test students under matched train--test distributions. Therefore, we study W2S preference learning under zero-shot distribution shift and find that strong students trained on weak preference labels can appear successful in-distribution while failing to transfer across preference datasets. We provide evidence for a representational failure mode in which weak-supervised fine-tuning can pull the strong model toward source-domain features instead of maintaining broadly transferable preference representations. To mitigate this, we propose Representation Anchoring (Anchor), a simple yet effective regularizer that constrains excessive drift from the pretrained strong model's representation space during fine-tuning, while still allowing task-relevant adaptation. Across preference domains, datasets, and model families, Anchor consistently improves out-of-distribution transfer while maintaining competitive in-distribution performance. Together, our evaluation protocol, transfer-aware metrics, and method expose hidden brittleness in current W2S reward modeling and provide a practical path toward more robust preference transfer.