MolSight: Molecular Property Prediction with Images

2026-05-11 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionComputation and Language

AI summaryⓘ

The authors study how well machines can predict molecular properties using just simple 2D pictures of molecules, instead of more complex 3D models or large language models. They tested many different vision models and training methods on millions of molecule images and found that using a curriculum based on chemical complexity improved results. Their approach performs competitively on various chemistry tasks while requiring far less computing power. This shows that looking at molecules like a human chemist does, using just images, can still give useful predictions.

Molecular Property Prediction2D Skeletal DiagramVision-based ModelsCurriculum LearningChemical ComplexityPhysical-property RegressionDrug DiscoveryQuantum ChemistryPre-trainingFLOPs

Authors

Aaditya Baranwal, Akshaj Gupta, Shruti Vyas, Yogesh S Rawat

Abstract

Every molecule ever synthesised can be drawn as a 2D skeletal diagram, yet in modern property prediction this universally available representation has received less focus in favour of molecular graphs, 3D conformers, or billion-parameter language models, each imposing its own computational and data-engineering overhead. We present $\textbf{MolSight}$, the first systematic large-scale study of vision-based Molecular Property Prediction (MPP). Using 10 vision architectures, 7 pre-training strategies, and $2\,M$ molecule images, we evaluate performance across 10 downstream tasks spanning physical-property regression, drug-discovery classification, and quantum-chemistry prediction. To account for the wide variation in structural complexity across pre-training molecules, we further propose a $\textbf{chemistry-informed curriculum}$: five structural complexity descriptors partition the corpus into five tiers of increasing chemical difficulty, consistently outperforming non-curriculum baselines. We show that a single rendered bond-line image, processed by a vision encoder, is sufficient for competitive molecular property prediction, i.e. $\textit{chemical insight from sight alone}$. The best curriculum-trained configuration achieves the top result on $\textbf{5 of 10}$ benchmarks and top two on $\textbf{all 10}$, at $\textbf{$\textit{80$\times$ lower}$}$ FLOPs than the nearest multi-modal competitor.

View PDFOpen arXiv