What DINO saw: ALiBi positional encoding reduces positional bias in Vision Transformers
2026-03-17 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors studied how vision transformers (ViTs), especially versions like DINOv2, can sometimes focus too much on where things are in an image instead of what they are. This positional bias is a problem in material science images, which often show uniform textures without a clear direction. They found this bias happens with various types of positional encoding and training goals. To fix it, the authors fine-tuned the models using a different positional encoding method called ALiBi, which reduced bias while keeping the models good at understanding image content. These improved models then worked better for segmenting complex microscope images.
Vision TransformersDINOv2Positional EncodingPositional BiasALiBiLinear ProbingZero-shot AdaptationMaterial Science ImagingMicroscopy Image Segmentation
Authors
Moritz Pawlowsky, Antonis Vamvakeros, Alexander Weiss, Anja Bielefeld, Samuel J. Cooper, Ronan Docherty
Abstract
Vision transformers (ViTs) - especially feature foundation models like DINOv2 - learn rich representations useful for many downstream tasks. However, architectural choices (such as positional encoding) can lead to these models displaying positional biases and artefacts independent of semantic content. This makes zero-shot adaption difficult in fields like material science, where images are often cross-sections of homogeneous microstructure (i.e. having no preferred direction). In this work, we investigate the positional bias in ViTs via linear probing, finding it present across a range of objectives and positional encodings, and subsequently reduce it by finetuning models to use ALiBi relative positional encoding. We demonstrate that these models retain desirable general semantics and their unbiased features can be used successfully in trainable segmentation of complex microscopy images.