RealityBridge: Bridging Editable 3D Gaussian Splatting Driving Simulations and Real-World Videos

2026-06-15 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors address the challenge of improving realistic simulation videos used for self-driving car safety testing. They focus on videos created with Editable 3D Gaussian Splatting (3DGS), which currently have visual problems like flickering and inconsistent lighting after editing. Their solution, called RealityBridge, uses multiple types of video information and a smart control system to fix these issues while keeping the scene's structure intact. They also trained their system specially to ensure better quality and smoother videos over time. Tests show their method works better than previous ones at making these simulated driving scenes look more real and stable.

3D Gaussian SplattingSim-to-Real gapautonomous driving simulationtemporal consistencyrendering artifactsforeground maskssemantic masksautoregressive trainingillumination harmonizationvideo restoration

Authors

Zhenhua Wu, Yun Pang, Mingkun Chang, Yuwei Ning, Liangzhi Wang, Yi Xiao, Guanbin Li

Abstract

Long-tail hazardous scenarios are essential for safety-oriented autonomous driving, yet they are difficult to collect and reproduce at scale. Editable 3D Gaussian Splatting (3DGS) simulation offers a promising alternative by reconstructing real driving scenes and supporting controllable scene editing. However, edited 3DGS-rendered videos still suffer from a significant Sim-to-Real gap, including rendering artifacts, degraded foreground assets, inconsistent illumination, and temporal flickering. Existing restoration and video generation methods are insufficient for this task, as they often fail to jointly repair 3DGS-specific artifacts, improve visual realism, and ensure temporal consistency. To fill this gap, we propose RealityBridge, a structure-preserving and asset-aware Sim-to-Real framework for edited 3DGS driving videos. RealityBridge uses multimodal controls, including rendered videos, foreground masks, edge maps, and semantic masks, together with a lightweight GateNet for adaptive condition allocation across backbone layers. We further construct targeted training data and introduce autoregressive long-video training with reward-guided post-training to improve restoration quality, temporal stability, and hallucination suppression. Extensive experiments on internal and public driving datasets show that RealityBridge outperforms existing methods in artifact removal, illumination harmonization, and long-sequence temporal consistency.

View PDFOpen arXiv