Gradient-Guided Reward Optimization for Inference-time Alignment

2026-06-08Computation and Language

Computation and Language
AI summary

The authors address the problem of making large language models more reliable when they encounter new or different kinds of text. Existing methods rely on generating many samples and picking the best one based on reward scores, but this can be limited by the model's original quality and can be tricked by inaccurate reward signals. They propose a new method called Gradient-Guided Reward Optimization (GGRO), which gently steers the model's output during generation by using gradients from a reward model to guide uncertain parts of the text. Their experiments show that GGRO improves the quality of responses and makes the system more resistant to reward hacking, all while adding little extra computation.

Large Language ModelsDistribution DriftInference-time AdaptationReward ModelsGradient GuidanceToken-level EntropySampling MethodsReward HackingDecodingModel Alignment
Authors
Hankun Lin, Ruqi Zhang
Abstract
Ensuring the reliability of Large Language Models (LLMs) under distribution drift requires inference-time adaptation. While inference-time alignment methods such as Best-of-$N$ and rejection sampling are widely used, they frame the task as a sampling-intensive, reward-guided search, leading to two key limitations: their performance is bounded by the base model's generation quality, and their reliance on imperfect reward models makes them vulnerable to reward hacking. To address these challenges, we introduce Gradient-Guided Reward Optimization (GGRO), a lightweight inference-time method that performs targeted, minimal intervention during decoding via gradient guidance. Specifically, GGRO monitors token-level entropy to identify high-uncertainty regions indicative of drift or misalignment. Upon detection, it responds by injecting nudging tokens, generated using gradient signals from an off-the-shelf reward model, to steer the generation trajectory rather than merely re-ranking samples. Experiments show that GGRO consistently improves inference-time alignment across safety, helpfulness, and reasoning benchmarks. It also increases coverage of high-quality responses and robustness to reward hacking, with minimal computational overhead. Code is available at https://github.com/lhk2004/GGRO.