Instruct-Particulate: Scaling Feed-Forward 3D Object Articulation with Kinematic Control

2026-06-12Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionGraphicsRobotics
AI summary

The authors created a model called Instruct-Particulate that helps break down 3D objects into moving parts based on instructions about how the parts connect and move. This approach uses a large mixed dataset with over 150,000 3D objects labeled using vision-language tools, which helps the model learn better despite limited direct annotations. The method can automatically get these instructions from advanced vision-language systems, allowing it to work on any 3D shape. Their experiments show the model works well on different object types and even 3D models made by AI, helping to turn real-world images into movable 3D objects.

3D mesharticulated objectskinematic specificationpart segmentationjoint motion parametersvision-language modelsheterogeneous datasetimage-to-3D modelsgeneralization3D reconstruction
Authors
Ruining Li, Yuxin Yao, Matt Zhou, Chuanxia Zheng, Christian Rupprecht, Joan Lasenby, Shangzhe Wu, Andrea Vedaldi
Abstract
Reconstructing articulated 3D objects is important for animation, gaming, and robotic simulations. Recent neural networks can estimate the articulated structure of 3D objects, but their generalization remains limited by the scarcity of annotated data for this task. To address this gap, we introduce Instruct-Particulate, a model that takes a 3D mesh together with a target kinematic specification, including part descriptions, connectivity, joint types, and optional point prompts, and predicts the corresponding kinematic part segmentation and joint motion parameters. The kinematic specification disambiguates the task and allows the model to target annotations of different granularity, thereby making it possible to use more abundant heterogeneous training data. At test time, the kinematic specification can be obtained automatically from large-scale vision-language models, so the model can be applied to any input mesh. To train our model at scale, we construct a heterogeneous dataset of more than 150,000 articulated 3D objects, extending existing publicly available collections with data obtained by partially labelling other 3D models (monolithic or already decomposed into parts) with kinematic labels by means of vision-language models. Experiments show that our model generalizes better across categories and to AI-generated meshes, enabling articulated asset reconstruction from real-world images via image-to-3D models.