Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups

2026-03-05 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionGraphics

AI summaryⓘ

The authors address the problem of missing textures when streaming 3D scenes from multiple cameras in real-time, which can cause holes or incomplete surfaces in the images. They propose a new method that fills in these missing parts after the images are rendered, using a special type of AI model that looks at multiple views and time frames to keep things consistent and detailed. Their approach can work with any camera setup and is fast enough for real-time use. They tested it against other methods and found it gives a better balance of quality and speed.

3D streamingmulti-camera systemstexture inpaintingnovel view renderingtransformer networksspatio-temporal embeddingsreal-time processinghole fillingAR/VR applications

Authors

Leif Van Holland, Domenic Zingsheim, Mana Takhsha, Hannah Dröge, Patrick Stotko, Markus Plack, Reinhard Klein

Abstract

High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.

View PDFOpen arXiv