Direction-Conditioned Policies via Compositional Subgoal Scoring for Online Goal-Conditioned Reinforcement Learning

2026-06-15Machine Learning

Machine LearningArtificial IntelligenceRobotics
AI summary

The authors show that to choose the best action to reach a goal, you only need to know the direction and distance toward that goal from where you are now, not the full goal details. They create a new method called Direction-Conditioned Policies (DCP) that breaks goal-reaching into two steps: picking a helpful intermediate state and then moving in the right direction from there. Their method trains these parts together but uses only the directional information during actual use. They prove mathematically that their approach matches optimal behavior under certain conditions and works well in many test environments, especially in tasks involving manipulation and obstacles.

Hamilton-Jacobi-Bellman theorygoal-conditioned reinforcement learningInfoNCE representationdirectional conditioningcontrol-affine dynamicssubgoal scoringcontrastive representation learningvalue gradientgeodesic slack
Authors
Swaminathan S K, Damiya Gondha, Theyanesh Eswaramoorthy Rajahkrishnan, Aritra Hazra
Abstract
Hamilton-Jacobi-Bellman theory implies that the optimal goal-conditioned action depends on the goal only through the gradient of the goal-reaching distance at the current state, yet standard online GCRL still conditions the actor on the raw goal -- a signal that is geometrically uninformative when the goal is far from the data distribution. We propose Direction-Conditioned Policies (DCP), a fully online method that decomposes goal-reaching into two components sharing one InfoNCE representation $ψ$: a subgoal-scoring step that selects a visited state $z_t$ aligned with the final goal $g$ in $ψ_g$, and a direction-conditioned actor that consumes the unit direction $d_t$ and magnitude $r_t$ from $ψ(s_t)$ to $ψ(z_t)$. The two components train jointly, factor cleanly at deployment (subgoal scoring is removed, while direction conditioning remains with $g$ in place of $z_t$), and admit independent modification at the same $(d_t,r_t)$ interface. We prove three results. First, direction sufficiency under HJB: the optimal action under control-affine dynamics depends on the goal only through the value gradient. Second, a quantitative bound showing that, under mild conditions on the learned representation and assuming the scoring rule returns an on-path $z_t$, the actor's conditioning input at training and at deployment coincide up to representation error and geodesic slack. Third, a controllable-subspace characterization of when directional conditioning fails. Across nine environments, DCP improves over Contrastive RL on most final metrics, with the largest gains on manipulation and obstacle-interaction tasks; a qualitative analysis of the learned $ψ$-distance landscape shows the contrastive representation behaves as an online quasimetric encoding environment topology, and the single failure case (AntSoccer) localizes to a learned-gradient pathology that the theory anticipates.