RATS! Patches Talk Through Registers: Emergent Parts in Register Attention Transformers

2026-06-12Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created a model called RATS that tries to understand images like humans do, by breaking down objects into meaningful parts without being told what those parts are. RATS uses special tokens that focus on different image patches and communicate through a structured attention process, which helps the model learn parts on its own. Their experiments show that RATS is better than other methods at segmenting images into meaningful regions and recognizing parts across related objects. This approach could help computers learn more interpretable ways to represent visual information.

self-supervised learningtransformersattention mechanismimage segmentationclassification tokensemantic regionsobject partsbenchmark datasetsADE20KCOCO
Authors
Timing Yang, Predrag Neskovic, Jansen Seheult, Wenchao Han, Anand Bhattad, Alan Yuille, Feng Wang
Abstract
When humans see a bird, they recognize far more than just "bird" -- they see a head, wings, and talons, a structured assembly of reusable parts that can be identified across every bird they have ever seen. We ask whether a self-supervised visual model can discover the same compositional structure on its own. To this end, we propose RATS (Register Attention Transformers), which decomposes the classification token into N learnable register tokens that route patch information through an L->N->N->L bottleneck via a three-step compress-communicate-broadcast attention. The N registers are partitioned across the H attention heads, so that registers assigned to different heads do not interact with each other. Without auxiliary losses or part annotations, each register spontaneously specializes into a proto-semantic region whose emerging structure resembles object parts. RATS surpasses all baselines by +12 mIoU on average across five segmentation benchmarks, with consistent gains on ADE20K (+1.11 mIoU) and COCO (+0.2 AP^m). Its register dictionary further exhibits part-level consistency and semantic proximity across related categories. Our results suggest that RATS may provide a useful architectural prior for structured and interpretable visual representation learning.