Formalizing the Binding Problem
2026-06-02 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial IntelligenceMachine Learning
AI summaryⓘ
The authors study how deep learning models, specifically Vision Transformers (ViTs), understand which features belong to the same object in an image, a challenge known as the binding problem. They developed a way to measure this 'binding information' in the models' inner workings using information theory. Their experiments show that parts of the ViT architecture do store binding information, and this ability is important for better recognizing and reasoning about objects in complex scenes. This work helps clarify how these models handle scenes with multiple objects sharing features or partially hidden.
Vision Transformersbinding problemfeature bindinginformation theoryimage recognitiondeep learningmodel representationobject recognitionpatch tokensocclusion
Authors
Lianghuan Huang, Yihao Li, Saeed Salehi, Yingshan Chang, Ansh Soni, Konrad P. Kording
Abstract
Representations of the world, arguably, contain information about features (e.g. something is blue, something is a circle) but also information about which features are part of the same object (e.g. the circle is blue), which we call binding information. Any system with the ability to understand scenes with multiple objects must be able to solve the binding problem: it needs to know which features belong together. However, despite work showing that Vision Transformers (ViTs) know which patches belong together, it is not known whether current deep learning models learn to exhibit binding information, i.e., for features. We may believe that there is not much binding information, after all misattributing features to wrong objects is a common failure of ViT-based architectures, especially in scenes with objects sharing features. Here we formalize the binding problem with an information-theoretic approach, and introduce a probing method to measure binding information in model representations. We perform experiments on ViTs, measuring binding from different components of the architecture, such as the image summary token [CLS] or the spatial tokens. We use datasets with different binding challenges, such as feature sharing, occlusion, and natural features, while comparing the performance of several pre-trained ViTs. Overall, our research demonstrates binding as a key ingredient to strong visual recognition and reasoning.