KEMO: Event-Driven Keyframe Memory for Long-Horizon Robot Manipulation with VLA Policies

2026-06-22Robotics

Robotics
AI summary

The authors address the difficulty robots face in long tasks where looks can be similar but important actions depend on what happened earlier. They created KEMO, a lightweight memory system that saves only important 'keyframe' moments when the robot's task changes significantly. KEMO uses both robot movement data and vision to pick these key moments and then helps the robot remember and focus on them during training. In tests, KEMO helped robots complete complex two-arm tasks much better than methods without memory, especially by picking the right moments to remember and combining the information smartly.

long-horizon manipulationrobot memorykeyframe selectionvisual-linguistic agents (VLA)cross-attentiongated residual fusionevent-driven samplingtask success ratesubtask completion
Authors
Yihan Zeng, Minghao Ye, Yiyuan Chen, Yide Shentu, Philipp Wu, Zike Yan, Zhongyu Li
Abstract
Long-horizon robot manipulation remains challenging because similar observations may occur at different execution stages, while the appropriate action depends on previously completed operations. Memory can address this ambiguity by enabling policies to infer task progress from execution history. However, existing memory-augmented approaches often either retain dense histories that require compression or rely primarily on recent context that may discard earlier task-relevant events. In this work, we propose propose KEMO, a lightweight plug-in memory framework that automatically selectively preserves keyframes associated with task-relevant state changes for VLA policies. KEMO combines robot kinematics with visual filtering to detect events, encodes the selected keyframes as compact temporally ordered memory tokens, and integrates them with current visual features through cross-attention and gated residual fusion for VLA training. The detected events also define higher-weight training samples near critical transitions. We evaluate KEMO on various real-world dual-arm manipulation tasks spanning 2 to 6 scored subtasks, and trajectory length ranging from 830 steps to 2846 execution steps (durations from 28 to 95 seconds). Compared with the memory-free baseline (e.g., $π_{0.5}$), KEMO improves aggregate Task Success Rate by 23.6\% and Stage Completion Rate by 34.1\%. Ablations show that event-driven keyframe selection outperforms uniform sampling and recent-frame retention, while the proposed gated fusion and keyframe-aligned loss weighting provide complementary gains.