SCAPO: Self-Supervised Category-Level Articulated Pose Estimation from a Single 3D Observation

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors present SCAPO, a method that can understand how a 3D object with moving parts is put together and how it moves, just by looking at one RGB-D image. Unlike previous methods, SCAPO does this without needing detailed labels, multiple views, or pre-made models. It aligns the object into a standard pose and then figures out the parts, joints, and how they move by teaching itself to reconstruct the shape. Tests show SCAPO can accurately find parts and joint information better than other self-supervised methods.

3D object articulationself-supervised learningRGB-D observationSE(3)-equivariancevector-neuron autoencodercanonical shaperigid part segmentationblend skinningjoint parameterscycle reconstruction
Authors
Can Zhang, Gim Hee Lee
Abstract
Existing methods for category-level object articulation from a single 3D observation often rely on dense supervision, multi-frame inputs, or CAD templates, and still struggle to disentangle geometry from articulation or to recover explicit joint parameters. We propose SCAPO, a self-supervised framework that estimates canonical geometry, rigid part segmentation, and joint pivots, axes, and articulation states from a single RGB-D observation without ground-truth labels or category-specific models. Our SCAPO first uses an SE(3)-equivariant vector-neuron autoencoder to factor out global pose and align diverse instances into a shared canonical space. On this aligned shape, a joint-aware blend-skinning module is then designed to model part motion. We learn this representation through cycle reconstruction between observed and canonical shapes and cross-space alignment with a learnable canonical template that decouples shared category geometry from instance-specific residual shape. Experiments on synthetic and real articulated-object datasets show that our SCAPO recovers consistent part structure and accurate articulation parameters and outperforms all self-supervised baselines.