Perceive-then-Plan: Layout-as-Policy for Monocular 3D Scene Layout Estimation
2026-05-25 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors describe a method to build 3D scenes from a single image by first detecting objects and then improving the scene step-by-step to make it physically realistic. They use a two-part system where one part finds objects and the other part plans small adjustments like moving or resizing objects to fix mistakes. Their approach treats the problem like a game where the system learns rules to make the scene look right, rather than just guessing everything at once. This method helps create 3D layouts that better match the image and follow real-world physical rules, and it also makes it easier to edit the scenes later.
3D scene layoutmonocular imagevision-language modelsPerceiverpolicy learningiterative refinementphysical plausibilitytrajectory initializationpreference-based optimizationscene editing
Authors
Junwei Zhou, Yu-Wing Tai
Abstract
Building structured 3D scene layouts from a single image requires reconciling visual observations with physical and spatial constraints, a challenge that is difficult to address with direct prediction alone. In this work, we formulate monocular 3D layout estimation as a perceive-then-plan problem with vision-language models, where a Perceiver first grounds the 3D objects and then a Planner iteratively refines the scene hypothesis through actions that improve physical plausibility while preserving consistency with the input image. We propose Layout-as-Policy (LaP), which casts the planning stage as a policy learning problem: 3D layouts are represented as structured states, and refined via discrete actions such as translation, rotation, and rescaling. Starting from an observation-aligned initialization with the geometry-enhanced Perceiver, the LaP Planner is trained to produce action sequences that progressively resolve geometric inconsistencies and enforce realistic spatial relations. To enable effective learning, we combine supervised trajectory initialization with preference-based optimization, allowing the model to learn corrective behaviors without requiring explicit reward engineering. This formulation transforms layout estimation from a one-shot prediction task into an iterative refinement process, enabling better handling of global constraints and complex object interactions. Experiments demonstrate that our approach produces layouts that are more physically coherent and better aligned with visual observations, while naturally supporting downstream tasks such as scene editing and manipulation.