PhyScene3D: Physically Consistent Interactive 3D Tabletop Scene Generation

2026-06-01Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors address the problem of creating 3D scenes of tabletop objects that are physically possible and can be used in robot simulations. They propose PhyScene3D, a method that builds scenes step-by-step like a human would, placing objects carefully to avoid collisions. To fix errors in training data and make scenes physically realistic, they introduce a technique called Physics-Aware Denoising Alignment, which adjusts scenes to obey physics while keeping their intended layout. Their approach produces more accurate and physically valid scenes compared to previous methods.

3D scene generationphysics simulationobject placementcollision avoidanceanchored reasoningaxis-aligned bounding box (AABB)signed distance field (SDF)test-time optimizationdenoisingrobotic learning
Authors
Weixing Chen, Zhuoqian Feng, Yang Liu, Yexin Zhang, Yifan Wen, Yinghong Liao, Weichao Qiu, Guanbin Li, Liang Lin
Abstract
Generating physically consistent 3D tabletop scenes is a fundamental yet underexplored problem for interactive and generalist robotic learning. The challenge stems from dense object hierarchies and irregular affordances. Here, an interactive scene denotes a physically valid, collision-free environment directly loadable into physics simulators. Existing methods, ranging from decoupled symbolic solvers to end-to-end regression models, often suffer from error propagation or overfitting to noisy supervision containing widespread physical violations. To address these limitations, we introduce PhyScene3D, a framework that reformulates generation as a Human-Mimetic Constructive Process. The proposed Cognitive Topological Reasoning Chain (CTRC) factorizes scene synthesis into a sequential, anchor-conditioned process. It employs a 3D AABB-based placement scheme that imposes a strong structural inductive bias. To address imperfect supervision and physical infeasibility, we introduce Physics-Aware Denoising Alignment (PADA). It integrates a differentiable Signed Distance Field (SDF) with Test-Time Optimization (TTO) to project generated scenes onto a physics-feasible manifold while preserving semantic intent. Experiments demonstrate that PhyScene3D outperforms state-of-the-art approaches in both semantic accuracy and physical validity, achieving a 40% reduction in scene-wise collision rate relative to the human-annotated training data.