EPS3D: End-to-End Feed-Forward 3D Panoptic Segmentation
2026-06-08 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors present EPS3D, a new method that can label and identify objects in 3D scenes using images from multiple views, all in one step without extra processing. Their training approach helps the model better understand both what things are (semantics) and which parts belong together (instances) by making these two kinds of information improve each other. This leads to more accurate and consistent 3D scene understanding. EPS3D performs better and faster than previous methods on standard tests, which is useful for applications like robots manipulating objects or editing 3D scenes.
3D panoptic segmentationopen-vocabularyend-to-end frameworkmulti-view imagesdistillation trainingsemantic featuresinstance featuresmutual enhancement modulemIoU3D scene understanding
Authors
Runsong Zhu, Jiaxin Guo, Xiaoyang Guo, Zhengzhe Liu, Ka-Hei Hui, Wei Yin, Kai Chen, Wei Chen, Weiqiang Ren, Yunhui Liu, Pheng-Ann Heng, Chi-Wing Fu
Abstract
This paper introduces EPS3D, a new end-to-end feed-forward framework for open-vocabulary 3D panoptic segmentation. Unlike existing methods relying on additional preprocessing, we design an end-to-end architecture, with a distillation-based training strategy on diverse 3D scenes to predict 3D-aware semantic and instance features from multi-view images, improving 3D consistency and avoiding error accumulation. We further propose a mutual enhancement module to enforce inherent semantic-instance consistency. By aligning semantics within instances (Ins2Sem) and refining instance features with semantic guidance (Sem2Ins), we achieve more coherent 3D scene understanding. Ultimately, EPS3D outperforms SOTA baselines on two benchmarks (e.g., +13% mIoU for semantics on Replica) with high efficiency (e.g., 1s per scene), supporting tasks like robotic manipulation and 3D scene editing.