VoxCor: Training-Free Volumetric Features for Multimodal Voxel Correspondence

2026-05-13Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors present VoxCor, a method that creates 3D features from 2D Vision Transformer models without additional training. VoxCor uses information from three anatomical planes and finds stable directions in the data to make its features reliable across different imaging types and subjects. This allows new medical images to be analyzed quickly without extra adjustments, improving tasks like image registration and segmentation. Their tests show VoxCor performs well compared to other approaches and can be reused for various medical imaging analyses.

3D medical image analysisVision Transformer (ViT)cross-modal imagingvoxelwise representationtriplanar inferenceweighted partial least squares (WPLS)image registrationnearest-neighbor searchsegmentationdeformable registration
Authors
Guney Tombak, Ertunc Erdil, Ender Konukoglu
Abstract
Cross-modal 3D medical image analysis requires voxelwise representations that remain anatomically consistent across imaging contrasts, scanners, and acquisition protocols. Recent work has shown that frozen 2D Vision Transformer (ViT) foundation models can support such representations, but typical pipelines extract features along a single anatomical axis and adapt those features inside a registration solver for one image pair at a time, leaving complementary viewing directions unused and producing representations that do not transfer to new volumes. We introduce VoxCor, a training-free fit--transform method for reusable volumetric feature representations from frozen 2D ViT foundation models. During an offline fitting phase, VoxCor combines triplanar ViT inference with a compact closed-form weighted partial least squares (WPLS) projection that uses fitting-time voxel correspondences to select modality-stable anatomical directions in the triplanar feature space. At transform time, new volumes are mapped by triplanar ViT inference and linear projection alone, without fine-tuning or registration. Voxel correspondences can then be queried directly by nearest-neighbor search. We evaluate VoxCor on intra-subject Abdomen MR--CT and inter-subject HCP T2w--T1w tasks using deformable registration, voxelwise k-nearest-neighbor segmentation, and segmentation-center landmark localization. VoxCor improves the hardest cross-subject, cross-modality transfer settings, reduces encoder sensitivity for dense correspondence transfer, and yields registration performance competitive with handcrafted descriptors and learned 3D features. This positions VoxCor as a reusable feature layer for downstream multimodal analysis beyond pairwise registration. Code, configuration files, and implementation details are publicly available on GitHub at \href{https://github.com/guneytombak/VoxCor}{guneytombak/VoxCor}.