The Limits of Learning from Pictures and Text: Vision-Language Models and Embodied Scene Understanding

2026-03-27Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors tested whether computers could learn how humans understand scenes just by looking at lots of pictures paired with descriptions. They found that while these vision-language models (VLMs) did well on general knowledge tasks, they struggled to understand what actions or uses objects afford (affordances). This gap didn't improve with better models or more detailed prompts, suggesting the problem is not just about wording. They concluded that fully grasping how humans see and interact with the world might need real-life, embodied experiences that can't be captured just from photos and captions.

vision-language modelsdistributional hypothesisaffordancesscene understandingembodied experienceHuman-Calibrated Cosine Distanceimage captioningco-occurrenceagent-centered cognition3D spatial information
Authors
Gillian Rosenberg, Skylar Stadhard, Bruce C. Hansen, Michelle R. Greene
Abstract
What information is sufficient to learn the full richness of human scene understanding? The distributional hypothesis holds that the statistical co-occurrence of language and images captures the conceptual knowledge underlying visual cognition. Vision-language models (VLMs) are trained on massive paired text-image corpora but lack embodied experience, making them an ideal test of the distributional hypothesis. We report two experiments comparing descriptions generated by 18 VLMs to those of over 2000 human observers across 15 high-level scene understanding tasks, spanning general knowledge, affordances, sensory experiences, affective responses, and future prediction. Because many tasks lack ground truth answers, we developed a Human-Calibrated Cosine Distance (HCD) metric that measures VLM output similarity to the distribution of human responses, scaled by within-human variability. In Experiment 1, VLMs approached human-level performance on general knowledge tasks, but showed a robust deficit for affordance tasks that resisted prompt engineering and did not improve with newer model releases. In Experiment 2, we tested six mechanistic hypotheses for explaining this affordance gap, finding that the deficit was structural rather than stylistic and was not resolved by providing explicit spatial information. Corpus analyses revealed that image captioning datasets contain sparse agent-addressed affordance language, consistent with Gricean accounts of why embodied knowledge may be systematically underrepresented in language. Together, these findings suggest that distributional learning from images and text is insufficient for affordance-based scene understanding, implying that some dimensions of human visual cognition may require the kind of agent-centered, three-dimensional experience that no photograph or caption can encode.