DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity

2026-06-08 • Machine Learning

Machine LearningComputation and Language

AI summaryⓘ

The authors address a problem where models that learn from comparing pairs often latch onto easy shortcuts instead of understanding true quality. They introduce DynaCF, a method that dynamically identifies and reduces the influence of these shortcuts during training by tweaking inputs in a way that keeps meaning but tests if the model changes its preference. By downweighting the confusing samples, the model focuses more on real quality indicators. Their experiments show that this approach makes preference models more reliable.

reward modelspairwise preferencesshortcut learningdynamic reweightingcounterfactual perturbationsBradley-Terry modelpreference modelingrobustnesssemantic preservation

Authors

Fengyuan Liu, Yongliang Miao, Zirui He, Yanguang Liu, Fei Sun, Mengnan Du

Abstract

Reward models trained from pairwise preferences often exploit superficial shortcut cues rather than learning true response quality. We propose DynaCF, a dynamic reweighting framework for mitigating shortcut learning in reward model training. Unlike static shortcut heuristics, DynaCF measures shortcut sensitivity online during optimization by applying semantics-preserving counterfactual perturbations and tracking the resulting margin shifts and preference flips under the current model. Samples with higher shortcut sensitivity are dynamically downweighted in the Bradley-Terry objective, encouraging the model to rely less on superficial patterns and more on task-relevant preference signals. Extensive experiments show that DynaCF consistently improves robustness in preference modeling.

View PDFOpen arXiv