AFUN: Towards an Affordance Foundation Model for Functionality Understanding

2026-06-01 • Robotics

RoboticsComputer Vision and Pattern Recognition

AI summaryⓘ

The authors developed a new model that helps robots understand where and how to interact with objects to complete tasks, using just a single RGB-D image and a description of the task. Their model predicts both the exact area on an object to touch and the 3D movement required after contact. They created a large dataset combining different sources to teach the model and showed it performs better than existing methods in locating interaction points and predicting movements. Importantly, their model works well in varied real-world settings without needing extra adjustments for different robots or tasks.

affordance understandingRGB-D imagingfunctional segmentation3D motion predictionrobot manipulationtask-conditional modelingpoint cloudgeneralizationobject-centric dataopen-world environments

Authors

Zhaoning Wang, Yi Zhong, Jiawei Fu, Henrik I. Christensen, Jun Gao

Abstract

Affordance understanding bridges visual perception and physical action, serving as an explainable interface for robot manipulation in open and unstructured real-world environments. Yet, building an affordance foundation model that not only understands where and how the interaction should happen, but also generalizes across diverse environments, objects, and tasks, remains a long-standing research challenge. Existing methods typically address only part of this challenge, either localizing task-relevant regions without specifying executable motion, or predicting motion but with limited scalability. In this paper, we present ourmodel, a step towards an affordance foundation model for functionality understanding. From a single RGB-D observation and a language task description, ourmodel predicts a task-conditional functional mask (where to interact) and a 3D post-contact motion curve (how to interact). To support open-world generalization, we build a large-scale standardized data pipeline that converts heterogeneous robot, human, simulation, and real-world scan data into a shared affordance schema with language, masks, and object-centric 3D motion labels. We evaluate ourmodel from three aspects: for affordance segmentation, ourmodel outperforms all baselines by a large margin across 8 test sets from 4 benchmarks, improving mean gIoU/cIoU by +23.9/+26.3; for contact-point prediction, it predicts substantially more accurate points, with a 12.7--61.3% hit-rate gain over the best baseline; and for 3D motion, it achieves the best performance on all three test sets. ourmodel can be deployed for real-world robot manipulation without finetuning for robot embodiment or using task-specific heuristics, demonstrating the ability to adapt to open-world affordance tasks. Project page: https://www.zhaoningwang.com/AFUN

View PDFOpen arXiv