Think as Needed: Geometry-Driven Adaptive Perception for Autonomous Driving
2026-05-11 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial Intelligence
AI summaryⓘ
The authors address the problem that current 3D detection models for autonomous driving waste computing power on easy scenes and struggle with complex ones. They introduce Enhanced HOPE, which adaptively decides how much processing a LiDAR frame needs based on scene complexity, without needing manual labels. Their system also groups nearby objects to speed up interaction modeling and remembers objects for several seconds even if they become hidden. Tests show this approach speeds up simple scenes, improves accuracy on rare cases, and successfully tracks objects through long occlusions.
3D detectionLiDARTransformerautonomous drivingadaptive computationobject trackingocclusionmean Average Precisionscene complexitytemporal memory
Authors
Donghyun Kim, Jaehyoung Park
Abstract
Autonomous driving scenes range from empty highways to dense intersections with dozens of interacting road users, yet current 3D detection models apply a fixed computation budget to every frame, wasting resources on simple scenes while lacking capacity for complex ones. Existing approaches compound this problem: Transformer-based interaction models scale quadratically with the number of detected objects, and frame-by-frame processing causes the system to immediately forget objects the moment they become occluded. We propose Enhanced HOPE, an adaptive perception architecture that measures the geometric complexity of each incoming LiDAR frame using an unsupervised statistical estimator and routes it through a shallow or deep processing path accordingly, requiring no manual scene labels. To keep interaction modeling efficient, we replace quadratic pairwise attention with a linear-time subspace-based network that groups nearby objects into clusters and processes them jointly. The computational savings from these two mechanisms free up resources for a persistent temporal memory module that retains previously detected objects and traffic rules across frames, enabling the system to recall occluded objects seconds after they disappear from view. On the nuScenes and CARLA benchmarks, Enhanced HOPE reduces latency by 38% on simple scenes with no accuracy loss, improves mean Average Precision by 2.7 points on rare long-tail scenarios, and tracks objects through occlusions lasting over 5 seconds, where all tested baselines fail.