Bridging Semantics and Kinematics: A Modular Framework for Zero-Shot Robotic Manipulation

2026-06-22Robotics

Robotics
AI summary

The authors created a system that helps robots understand and follow human language instructions to move objects correctly, without needing any prior training. Their system breaks down the task into seeing the objects, understanding the instructions, and then moving the robot safely. They use special tools to clearly identify objects and a language model to interpret commands into tasks the robot can do. Tests showed the robot could complete about 62% of complicated tasks accurately in new, unfamiliar settings. This means their approach can help robots do complex jobs by just telling them what to do in plain language.

Zero-shot learningVision-Language Models (VLMs)Large Language Models (LLMs)Robotic manipulationVisual perceptionSemantic interpretationTask executionFastSAMMoveIt Task Constructor (MTC)Spatial reasoning
Authors
Ali Alabbas, Dipshikha Das, Camillo Murgia, Sainul Ansary, Alaa Elkamash, Philip Long
Abstract
This paper presents a modular training-free framework for zero-shot, language-guided robotic manipulation in semi-structured environments. The architecture bridges the gap between high-level reasoning and low-level kinematics by decomposing the vision-action pipeline into three stages: visual perception, semantic interpretation, and task execution. To overcome the spatial ambiguity and semantic hallucinations inherent in standard Vision-Language Models (VLMs), the perception module employs FastSAM and Set-of-Mark (SoM) prompting to dynamically generate grounded, alphanumeric visual anchors. The same foundation model then operates purely as a Large Language Model (LLM) to act as a semantic router, translating unconstrained human directives into verifiable, reconfigurable configurations. Finally, these configurations are dynamically parsed by a Task Orchestrator into MoveIt Task Constructor (MTC) to generate collision-free trajectories. The framework is evaluated across two zero-shot experimental setups: unconstrained open-world sequential manipulation and dense relational spatial reasoning, achieving a 62% end-to-end task success rate across both scenarios, demonstrating its capacity to reliably execute complex physical actions without domain-specific training or manual coordinate programming.