AnyRecon: Arbitrary-View 3D Reconstruction with Video Diffusion Model

2026-04-21Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors address the challenge of creating 3D models from a few random photos, which is normally hard because existing methods don't handle many images well or keep the 3D shape consistent. They introduce AnyRecon, a system that can use any number of unordered pictures to build a 3D scene while maintaining accurate geometric details. Their approach stores all views in a special memory and uses a geometry-aware method to connect image generation and reconstruction, improving results for large or complex scenes. They also make the method faster and more scalable by optimizing the model's attention and diffusion steps. Tests show their system works well even with irregular inputs and big changes in camera angles.

Sparse-view 3D reconstructionDiffusion modelsGeometric consistencyScene memoryCapture view cacheGeometry-aware conditioningDiffusion distillationSparse attention3D geometric memoryLarge viewpoint gaps
Authors
Yutian Chen, Shi Guo, Renbiao Jin, Tianshuo Yang, Xin Cai, Yawen Luo, Mingxin Yang, Mulin Yu, Linning Xu, Tianfan Xue
Abstract
Sparse-view 3D reconstruction is essential for modeling scenes from casual captures, but remain challenging for non-generative reconstruction. Existing diffusion-based approaches mitigates this issues by synthesizing novel views, but they often condition on only one or two capture frames, which restricts geometric consistency and limits scalability to large or diverse scenes. We propose AnyRecon, a scalable framework for reconstruction from arbitrary and unordered sparse inputs that preserves explicit geometric control while supporting flexible conditioning cardinality. To support long-range conditioning, our method constructs a persistent global scene memory via a prepended capture view cache, and removes temporal compression to maintain frame-level correspondence under large viewpoint changes. Beyond better generative model, we also find that the interplay between generation and reconstruction is crucial for large-scale 3D scenes. Thus, we introduce a geometry-aware conditioning strategy that couples generation and reconstruction through an explicit 3D geometric memory and geometry-driven capture-view retrieval. To ensure efficiency, we combine 4-step diffusion distillation with context-window sparse attention to reduce quadratic complexity. Extensive experiments demonstrate robust and scalable reconstruction across irregular inputs, large viewpoint gaps, and long trajectories.