Chameleon: Episodic Memory for Long-Horizon Robotic Manipulation

2026-03-25Robotics

RoboticsArtificial IntelligenceComputer Vision and Pattern Recognition
AI summary

The authors explain that robots often need memory to handle situations where what they see isn't enough to know what to do next, because different past events can look the same. They created a system called Chameleon that stores detailed, geometry-based information from multiple senses to help the robot remember important details and make better decisions. They also made a new dataset using a real robot to test their method in tricky tasks where things look very similar but need different actions. Their approach helps robots perform more reliably and plan further ahead compared to other methods when the environment is confusing.

robotic manipulationmemory in roboticsperceptual aliasingMarkov decision processepisodic memorymultimodal tokensdifferentiable memory stackUR5e robotsequential manipulationdataset
Authors
Xinying Guo, Chenxi Jiang, Hyun Bin Kim, Ying Sun, Yang Xiao, Yuhang Han, Jianfei Yang
Abstract
Robotic manipulation often requires memory: occlusion and state changes can make decision-time observations perceptually aliased, making action selection non-Markovian at the observation level because the same observation may arise from different interaction histories. Most embodied agents implement memory via semantically compressed traces and similarity-based retrieval, which discards disambiguating fine-grained perceptual cues and can return perceptually similar but decision-irrelevant episodes. Inspired by human episodic memory, we propose Chameleon, which writes geometry-grounded multimodal tokens to preserve disambiguating context and produces goal-directed recall through a differentiable memory stack. We also introduce Camo-Dataset, a real-robot UR5e dataset spanning episodic recall, spatial tracking, and sequential manipulation under perceptual aliasing. Across tasks, Chameleon consistently improves decision reliability and long-horizon control over strong baselines in perceptually confusable settings.