Few-Shot Incremental 3D Object Detection in Dynamic Indoor Environments

2026-04-09 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors present FI3Det, a new approach to help computers recognize 3D objects in indoor spaces even when only a few examples of new objects are available. They use vision-language models to identify unknown objects and understand them better by combining 2D visual and 3D shape information. FI3Det also has ways to reduce noise in the data and smartly combine different features to improve detection. Their method was tested on two common 3D datasets, showing better results than previous techniques for this problem.

Incremental 3D detectionFew-shot learningVision-language models3D object perception3D bounding boxesSemantic featuresMultimodal learningPrototype imprintingScanNet V2SUN RGB-D

Authors

Yun Zhu, Jianjun Qian, Jian Yang, Jin Xie, Na Zhao

Abstract

Incremental 3D object perception is a critical step toward embodied intelligence in dynamic indoor environments. However, existing incremental 3D detection methods rely on extensive annotations of novel classes for satisfactory performance. To address this limitation, we propose FI3Det, a Few-shot Incremental 3D Detection framework that enables efficient 3D perception with only a few novel samples by leveraging vision-language models (VLMs) to learn knowledge of unseen categories. FI3Det introduces a VLM-guided unknown object learning module in the base stage to enhance perception of unseen categories. Specifically, it employs VLMs to mine unknown objects and extract comprehensive representations, including 2D semantic features and class-agnostic 3D bounding boxes. To mitigate noise in these representations, a weighting mechanism is further designed to re-weight the contributions of point- and box-level features based on their spatial locations and feature consistency within each box. Moreover, FI3Det proposes a gated multimodal prototype imprinting module, where category prototypes are constructed from aligned 2D semantic and 3D geometric features to compute classification scores, which are then fused via a multimodal gating mechanism for novel object detection. As the first framework for few-shot incremental 3D object detection, we establish both batch and sequential evaluation settings on two datasets, ScanNet V2 and SUN RGB-D, where FI3Det achieves strong and consistent improvements over baseline methods. Code is available at https://github.com/zyrant/FI3Det.

View PDFOpen arXiv