Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

2026-05-01Software Engineering

Software EngineeringMachine Learning
AI summary

The authors look at how reward models (RMs) can be improved for generating computer code, not just checking if the code runs correctly. They create a new benchmark called Themis-CodeRewardBench to test RMs on different qualities across multiple programming languages. Finding current models limited, they collect a large dataset of coding preferences and use it to train Themis-RM, a set of improved reward models that consider multiple criteria. Their results show better performance when models are bigger and trained on diverse examples, highlighting the value of using various criteria when teaching RMs about code.

reward modelslanguage modelscode generationfunctional correctnessmultilingualmulti-criteriabenchmarkpreference datamodel scalingcross-lingual transfer
Authors
Indraneil Paul, Glavaš Glavas, Iryna Gurevych
Abstract
Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.