Attend to Anything: Foundation Model for Unified Human Attention Modeling

2026-06-02 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors created a new model called Attend to Anything Model (AAM) that tries to predict where people look (attention) across many types of images, videos, and sounds all at once. Instead of treating each task or scene separately, their model uses language and math ideas to understand attention in a more general way. They also used a physics approach to connect attention in videos over time. Tests show their model works better and faster than previous ones on many datasets, suggesting it could be a useful base for future attention research.

attention modelingsaliencymulti-modalhierarchical embeddingshyperbolic spaceFokker-Planck equationvideo inferencefoundation modeldiffusion process

Authors

Wenzhuo Zhao, Ronghao Xian, Keren Fu, Qijun Zhao

Abstract

Existing human attention (saliency) modeling methods persist as highly fragmented across modalities, scenes, and task formulations. Consequently, even with increasing model capacity and data scale, current models predominantly remain scene-dependent and task-specific, failing to practically generalize in real-world applications. To address the fundamental limitations, we present the Attend to Anything Model (AAM), a multi-modal foundation model that unifies attention modeling across various image, video, and audio-visual tasks and scenes. AAM reformulates attention as a cognitive entailment relationship organized in a general-to-specific hierarchy, implemented through language prompts with hierarchical embeddings in hyperbolic space. Furthermore, to unify static image and dynamic video attention, we adopt a fluid-dynamics perspective, formulating video-frame attention as a diffusive temporal evolution governed by the Fokker--Planck equation. Extensive experiments on 16 benchmarks demonstrate that AAM consistently outperforms state-of-the-art methods by an average of 6\% across various scenarios, while achieving approximately a 4$\times$ speedup in video inference. Overall, these results demonstrate that AAM provides a principled foundation for future research on attention and saliency-related tasks. The dataset and code will be available at https://github.com/wz-zhao/Attend-to-Anything.

View PDFOpen arXiv