Dual Advantage Fields

2026-06-02 • Machine Learning

Machine LearningArtificial IntelligenceRobotics

AI summaryⓘ

The authors focus on improving how offline goal-conditioned reinforcement learning picks actions. They introduce Dual Advantage Fields (DAF), a method that uses a mathematical model to estimate the benefit of each possible action in relation to the goal. DAF learns how actions move the system's state in feature space and chooses actions that best align with reaching the goal, providing a local advantage signal. They show that this approach improves performance on tasks involving movement, manipulation, and puzzles, especially when the best local action doesn't just point directly to the goal.

reinforcement learninggoal-conditioned learningoffline learningvalue functionadvantage functionpolicy improvementbilinear modelsfeature spaceBellman equationlocal action selection

Authors

Alexey Zemtsov, Maxim Bobrin, Alexander Nikulin, Dmitry V. Dylov, Fakhri Karray, Vladislav Kurenkov, Martin Takáč, Arip Asadulaev

Abstract

Offline goal-conditioned reinforcement learning requires both long-horizon reachability estimates and local action comparisons. Dual goal representations provide value fields that capture global goal reachability, but they do not directly specify which action should be preferred at a given state. We propose Dual Advantage Fields, a policy-extraction method that turns a bilinear dual value model into a local advantage signal. Under bilinear dual parameterization, the goal embedding is the gradient of the value field with respect to the state representation. DAF learns an action-effect model that predicts the discounted feature displacement induced by an action and scores actions by the alignment between this displacement and the goal direction. In the realizable case, this score equals the goal-conditioned Bellman advantage, yielding a standard local policy-improvement guarantee. On OGBench locomotion, manipulation, and puzzle tasks, DAF improves aggregate RLiable metrics and performs strongly in settings where locally correct actions differ from direct movement toward the final goal.

View PDFOpen arXiv