SIR: Structured Image Representations for Explainable Robot Learning
2026-06-29 • Robotics
RoboticsComputer Vision and Pattern Recognition
AI summaryⓘ
The authors developed a new way for robots to understand images by turning pictures into connected graphs that highlight important parts for tasks. Their method, called Structured Image Representations (SIR), picks out relevant objects and ignores distractions, making the robot's decisions easier to follow and explain. When tested, their approach did better than traditional image-based methods. They also used their graphs to find hidden problems in the training data, like misleading connections or biases.
Scene GraphsVisual EmbeddingsRobot Policy LearningGraph SparsificationExplainabilityRoboCasaStructured Image RepresentationsSpurious CorrelationsPositional Biases
Authors
Paul Mattes, Jan Schwab, Jens Bosch, Nils Blank, Maximilian Xiling Li, Minh-Trung Tang, Moritz Haberland, Rudolf Lioutikov
Abstract
Existing robot policies based on learned visual embeddings lack explicit structure and are sensitive to visual distractions. Thus, the representations that drive their behaviour are often opaque, making their decision-making process difficult to interpret. To address this, we introduce Structured Image Representations (SIR), a method that leverages Scene Graphs (SGs) as an intermediate representation for robot policy learning. Our approach first constructs a fully connected graph, using image-derived features as initial node representations. Then, a module learns to sparsify this graph end-to-end, creating a task-relevant sub-graph that is passed to the action generation model. This process makes our model intrinsically explainable. Evaluations on RoboCasa show that our sparse graph policies outperform image-based baselines on average with 19.5% vs 14.81% success rate. Most importantly, we show that the learned sparse graphs are a powerful tool for model analysis. By analysing when the model's sub-graph deviates from human expectation, such as by including distractor nodes or omitting key objects, we successfully uncover dataset biases, including spurious correlations and positional biases. https://github.com/intuitive-robots/SIR_Model