Cheap Reward Hacking Detection

2026-06-08Machine Learning

Machine LearningArtificial IntelligenceCryptography and Security
AI summary

The authors trained a small transformer model to turn robot movement data (Terminal-Wrench trajectories) into points on a sphere where distances relate to reward signals and metadata. They then used a simple classifier on these points to detect when the robot was cheating (reward hacking) with high accuracy, matching more complex methods but using far less computing power. The model needs some natural language reasoning to work well, as removing language input lowers its effectiveness. This suggests their approach efficiently combines behavior and language information to identify problems in robot behavior evaluation.

transformer encoderTerminal-Wrench trajectoriesunit sphere embeddingreward hackinglinear probeAUCtrue positive ratefalse positive ratenatural language reasoningLLM-as-judge
Authors
Iván Belenky, Joaquín Itria, Steven Johns
Abstract
A small transformer encoder is trained to map Terminal-Wrench trajectories onto a unit sphere where embedding distance approximates the $L_1$ distance between reward and metadata signals. A linear probe on top of that embedding detects reward hacking on the cleaned test split with AUC $0.9467$ and TPR@5%FPR $0.8296$, matching the TW sanitized LLM-as-judge AUC ($0.9510$ on the cleaned split) and exceeding its TPR@5%FPR ($0.7130$ vs $0.8296$) on the same information condition, at roughly four orders of magnitude lower per-trajectory cost. The encoder is not a pure behavior reader: stripping natural-language reasoning from its input at probe time drops AUC to $0.6213$.