Training and Evaluating Diffusion Policies with Long Context Lengths

2026-06-15Robotics

RoboticsArtificial Intelligence
AI summary

The authors studied how the length of past information (context) affects robot learning from videos. They found that using more past information isn't as tricky as others thought if combined with the right methods. They also created a way to train robots using multiple amounts of past info at once, which helps them learn faster. Finally, they used their results to re-examine earlier ideas about learning with long context in robots.

imitation learningrobotic manipulationcontext lengthpolicy conditioningUNetCross-Attentionsample complexitylong-context learningRGB observationsdenoising backbone
Authors
Abhinav Agarwal, Adam Wei, Taylan Kargin, Michael Zeng, Cole Becker, Arif Kerem Dayi, Pablo Parrilo, Asuman Ozdaglar, Russ Tedrake
Abstract
Imitation learning has enabled highly-dexterous robotic manipulation from RGB observations. Policies trained with these methods, however, typically condition robot actions on only a short history of observations. These policies cannot solve tasks that require memory and can get stuck repeatedly executing the same failing motions. In this work, we first benchmark policy performance as context length is incrementally increased from short to long, across a spectrum of tasks with varying local stability and memory requirements, and in multiple data regimes. To our knowledge, this is the first study to investigate context length in imitation learning at this level of detail. Our results challenge prior claims: naively scaling context length is not as brittle as advertised in literature. With an appropriate conditioning method and denoising backbone (UNet+Cross-Attention), single-task policies achieve high success rates on many tasks in the usual data regime even with naive scaling. Next, we propose a training algorithm to jointly train policies at multiple context lengths, further reducing the sample complexity of long-context learning. Finally, we apply our findings to re-evaluate some previously proposed solutions to long-context imitation learning.