HeteroGenManip: Generalizable Manipulation For Heterogeneous Object Interactions

2026-05-11Robotics

RoboticsArtificial Intelligence
AI summary

The authors address the challenge of robots handling different types of objects by breaking down the task into two steps: first choosing where to grab the object, then planning how to move it. They created a new system called HeteroGenManip that separates these steps and uses specialized models for different object categories. This approach helps robots be more precise in grabbing and better at complex movements, improving their performance in both simulations and real-world tests. Their method shows notable improvements compared to previous one-model approaches.

robotic manipulationcontact point localizationtrajectory planningfoundation modelscross-attention mechanismpose uncertaintydiffusion policygeneralizationcategory-specific modeling
Authors
Zhenhao Shen, Zeming Yang, Yue Chen, Yuran Wang, Shengqiang Xu, Mingleyang Li, Hao Dong, Ruihai Wu
Abstract
Generalizable manipulation involving cross-type object interactions is a critical yet challenging capability in robotics. To reliably accomplish such tasks, robots must address two fundamental challenges: ``where to manipulate'' (contact point localization) and ``how to manipulate'' (subsequent interaction trajectory planning). Existing foundation-model-based approaches often adopt end-to-end learning that obscures the distinction between these stages, exacerbating error accumulation in long-horizon tasks. Furthermore, they typically rely on a single uniform model, which fails to capture the diverse, category-specific features required for heterogeneous objects. To overcome these limitations, we propose HeteroGenManip, a task-conditioned, two-stage framework designed to decouple initial grasp from complex interaction execution. First, Foundation-Correspondence-Guided Grasp module leverages structural priors to align the initial contact state, thereby significantly reducing the pose uncertainty of grasping. Subsequently, Multi-Foundation-Model Diffusion Policy (MFMDP) routes objects to category-specialized foundation models, integrating fine-grained geometric information with highly-variable part features via a dual-stream cross-attention mechanism. Experimental evaluations demonstrate that HeteroGenManip achieves robust intra-category shape and pose generalization. The framework achieves an average 31\% performance improvement in simulation tasks with broad type setting, alongside a 36.7\% gain across four real-world tasks with different interaction types.