Bidirectional Cross-Modal Prompting for Event-Frame Asymmetric Stereo

2026-04-16 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors focus on combining two types of cameras: regular frame-based cameras and event cameras, which capture motion differently. They designed a method called Bi-CMPStereo that helps these two camera types work together better for 3D perception, especially in fast-moving or tricky lighting conditions. Their approach aligns features from both cameras into a shared space to improve matching accuracy. Tests showed their method works better than existing ones on this task.

frame-based camerasevent cameras3D perceptionstereo matchingcross-modaldomain adaptationsemantic featuresstructural featuresdynamic scenesmotion blur

Authors

Ninghui Xu, Fabio Tosi, Lihui Wang, Jiawei Han, Luca Bartolomei, Zhiting Yao, Matteo Poggi, Stefano Mattoccia

Abstract

Conventional frame-based cameras capture rich contextual information but suffer from limited temporal resolution and motion blur in dynamic scenes. Event cameras offer an alternative visual representation with higher dynamic range free from such limitations. The complementary characteristics of the two modalities make event-frame asymmetric stereo promising for reliable 3D perception under fast motion and challenging illumination. However, the modality gap often leads to marginalization of domain-specific cues essential for cross-modal stereo matching. In this paper, we introduce Bi-CMPStereo, a novel bidirectional cross-modal prompting framework that fully exploits semantic and structural features from both domains for robust matching. Our approach learns finely aligned stereo representations within a target canonical space and integrates complementary representations by projecting each modality into both event and frame domains. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods in accuracy and generalization.

View PDFOpen arXiv