See & Sniff: Learning Visuo-Olfactory Representations

2026-06-25Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition
AI summary

The authors created a new dataset called SmellNet-V that links smells with matching images based on their category, without needing to collect both at the same time. They developed a method named See & Sniff that learns to connect visual information with smells, allowing it to identify and locate smells in images. Their approach performs better than smell-only methods and works well for tasks like smell recognition and finding where a smell comes from in a picture. This work opens up a new area in combining vision and smell in AI.

multimodal learningolfactionself-supervised learningvisuo-olfactory datasetcross-modal retrievalsmell classificationspatial groundingdense local alignmentsaliency mapsbenchmarking
Authors
Seongyu Kim, Seungwoo Lee, Hyeonggon Ryu, Joon Son Chung, Arda Senocak
Abstract
While modern multimodal models integrate vision with language, audio, or touch, olfaction remains largely unexplored due to the lack of paired visuo-olfactory data. We introduce SmellNet-V, a scalable visuo-olfactory dataset built on the insight that odor identity is largely invariant to visual transformations within a semantic category. This allows us to synthetically pair smell-only samples with semantically aligned in-the-wild web images, converting a unimodal olfactory dataset into a cross-modal benchmark without costly co-collection. Building on this dataset, we propose See & Sniff, a self-supervised framework that learns joint visuo-olfactory representations via dense local alignment and naturally produces smell saliency maps for spatial grounding of odor sources. We further introduce pixel-level smell localization task and a benchmark for evaluation. Our method surpasses smell-only baselines by 7% in smell classification from smell alone and generalizes to cross-modal retrieval and smell localization, establishing visuo-olfactory learning as a new direction in multimodal perception.