PillarDETR: YOLO-Backbone and RT-DETR Head for Real-Time 3D Object Detection

2026-06-01 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors created PillarDETR, a new system to quickly detect 3D objects using LiDAR data for self-driving cars and robots. Instead of using slow 3D convolutions, they use a faster method by turning point clouds into pillar-based pseudoimages and applying a YOLOv8-inspired network to extract features. They also use a special transformer decoder to directly find 3D boxes, skipping traditional steps like non-maximum suppression. Tests show PillarDETR balances accuracy and speed better than earlier methods like PointPillars.

LiDAR3D object detectionPoint cloudsPillar encodingYOLOv8Cross Stage Partial (CSP)Transformer decoderNon-Maximum Suppression (NMS)KITTI datasetnuScenes dataset

Authors

Smit Kadvani, Shriya Gumber, Kriti Faujdar, Harsh Dave

Abstract

Real-time 3D object detection is a critical component for the safe operation of autonomous driving systems and robotics. While LiDAR point clouds provide accurate spatial information, processing them efficiently remains a significant challenge. Traditional methods rely on complex 3D convolutions or anchor-based paradigms that struggle to balance detection accuracy with inference speed. In this paper, we propose PillarDETR, a novel end-to-end 3D object detection architecture that combines the efficiency of pillar-based LiDAR encoding with the representational power of modern 2D vision models. Specifically, PillarDETR replaces standard convolutional backbones with a Cross Stage Partial (CSP) network derived from YOLOv8, enabling richer feature extraction from pseudoimages. Furthermore, we discard conventional anchor-based or center-based detection heads in favor of a Real-Time Detection Transformer (RT-DETR) decoder. This hybrid design allows the network to capture global context and directly predict 3D bounding boxes without relying on non-maximum suppression (NMS). Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that PillarDETR achieves a compelling trade-off between mean Average Precision (mAP) and inference latency. Our ablation studies confirm that integrating the YOLOv8 backbone and RT-DETR head yields substantial improvements over the PointPillars baseline, establishing PillarDETR as a highly effective solution for real-time 3D perception.

View PDFOpen arXiv