RepSAM: Bridging Foundation Models to Robotic Vision via Representation-Guided Adaptation

2026-05-25Robotics

RoboticsComputer Vision and Pattern Recognition
AI summary

The authors found that different layers of a large vision model behave differently when used for robot vision, with early layers struggling more to adapt to new environments. They created RepSAM, a method that efficiently fine-tunes only parts of the model guided by this insight, saving lots of training time and computing resources. Their tests show RepSAM performs almost as well as full fine-tuning but is much faster and better than other efficient methods. This also leads to better performance in real robot tasks like manipulating objects. Overall, their approach helps robots see better in tricky settings without needing lengthy training.

robotic perceptionfoundation modelsSAM (Segment Anything Model)representation shiftCKA (Centered Kernel Alignment)parameter-efficient fine-tuningRepSAMmulti-modal fusionmIoU (mean Intersection over Union)robotic manipulation
Authors
Wenhui Chu
Abstract
Robotic perception in unstructured environments remains challenging despite the zero-shot capabilities of foundation models such as SAM. This work attributes performance degradation to non-uniform representation shifts across transformer layers: shallow layers exhibit substantial domain gaps (CKA < 0.5), whereas deep layers transfer effectively (CKA > 0.7). Based on this observation, we propose RepSAM, a representation-guided parameter-efficient fine-tuning (PEFT) framework for adapting foundation models to robotic vision. RepSAM employs a theoretically grounded CKA-guided rank allocation strategy combined with a multi-modal fusion module for robust handling of challenging robotic scenarios, including transparent objects and cluttered scenes. Experimental evaluation across six benchmarks and robotic manipulation tasks demonstrates that RepSAM achieves 97.9% of full fine-tuning performance (89.0% vs. 90.9% mIoU) while reducing trainable parameters by 158x (from 632M to 4.0M). RepSAM outperforms DoRA by 7.9% mIoU with just 4 hours of training on a single A100 GPU (a 96x reduction from full fine-tuning, which takes 384 GPU-hours). These improvements are statistically significant (p < 0.01) and translate to a 12.0% absolute improvement in robotic manipulation success rates over the LoRA (RGB) baseline.