The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning

2026-06-08Machine Learning

Machine Learning
AI summary

The authors found that Process Reward Models (PRMs), which give feedback on each reasoning step, have a hidden bias caused by uneven training data. This bias leads PRMs to mistakenly reward wrong steps, which then negatively affects decision making in reasoning tasks. To fix this, they created PRISM, a new training method that compares steps directly and uses hard-to-classify examples without needing extra human labels. Their method reduces false positives and improves accuracy and robustness in reasoning tasks. They emphasize that good feedback is about rewarding correct reasoning for the right reasons, not just high scores.

Process Reward Models (PRMs)credit assignmentcross-entropy trainingfalse positivescontrastive learningpolicy optimizationguided decodingBest-of-N selectioncurriculum learningstep-level feedback
Authors
Aakriti Agrawal, Souradip Chakraborty, Armin Saghafian, Nihal Sharma, Rizal Fathony, Nam H Nguyen, C. Bayan Bruss, Amrit Singh Bedi, Furong Huang
Abstract
Process Reward Models (PRMs) improve credit assignment for reasoning by providing step-level feedback. However, we identify a hidden bias in PRMs caused by severe imbalance in step-level training data. Standard cross-entropy training amplifies this bias, causing PRMs to overcredit plausible but incorrect steps and produce high false-positive rates. We show that these false positives have an asymmetric downstream effect: false negatives mainly slow exploration, whereas false positives actively steer Best-of-N selection, guided decoding, and policy optimization toward flawed reasoning. This suggests that PRM training should shift from pointwise label fitting to reliable relative comparisons. To address this, we propose PRISM (Precision Ranking for Improved Step Modeling), a policy-aware PRM training framework that learns from contrastive step-level comparisons and hard negatives generated by a temporal lookahead strategy, requiring no new human labels. We further use a difficulty-aware curriculum to optimize the contrastive step margin. Across PRMBench and ProcessBench, PRISM substantially reduces false positives (22% on PRMBench) and improves macro F1 over strong discriminative PRMs. When applied to policy optimization and search tasks, including guided decoding and Best-of-N selection, it consistently improves accuracy (up to 22% for guided decoding and 33% for Best-of-N) and robustness. More broadly, trustworthy process supervision is not just about assigning high rewards, but about rewarding the right reasoning for the right reasons.