Latent Actions from Factorized Transition Effects under Agent Ambiguity
2026-06-29 • Artificial Intelligence
Artificial Intelligence
AI summaryⓘ
The authors address the problem that in videos with many moving objects and distractions, it's hard to tell which changes are caused by the main actor's actions. They propose Observed Transition Factorization (OTF) to break down each scene change into simpler, reusable parts called transition primitives. Building on this, their OTF-LAM method creates action-like signals that better capture the true cause of changes, with a version called OTF-LAM-Dino that predicts future frames without decoding images. Their experiments show these primitives work well across different situations and help with learning control policies even when the scene is complex and ambiguous.
Latent Action ModelsObserved Transition Factorizationtransition primitivesinverse dynamicsforward dynamicsDINOv2zero-shot transferpolicy learningmulti-object scenesvisual representation
Authors
Heejeong Nam, Chandradithya S Jonnalagadda, Harshit Aggarwal, Eric Xu, Randall Balestriero
Abstract
Latent Action Models (LAMs) learn action-like proxies from observation transitions. However, in multi-object or distractor-rich scenes, these visual effects mix agent motion with distractors, camera dynamics, and background changes, making the underlying action source ambiguous without supervision. Structuring this mixture as reusable transition effects provides an intermediate representation from which action-like latents can be more robustly formed. We introduce Observed Transition Factorization (OTF), which decomposes each transition into a sparse set of observed transition primitives. Using these primitives as the transition interface, we propose OTF-LAM, which abstracts motion primitives into action-like latents within the standard inverse-forward dynamics framework, and OTF-LAM-Dino, a decoder-free variant that predicts future states in a frozen DINOv2 representation space. Empirically, OTF primitives transfer zeroshot across controlled carrier and morphology shifts, showing reusability. Furthermore, downstream policy learning results match or outperform baselines under complex transition ambiguity.