MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph

2026-05-11 • Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence

AI summaryⓘ

The authors created MicroWorld, a system that helps AI models better understand and reason about microscopy images without needing extra training. They built a big knowledge graph linking scientific images and captions by identifying biomedical entities and their relationships. When the AI model gets a question, MicroWorld matches it with this graph to provide useful context, improving the model's accuracy on microscopy reasoning tests. Their approach improved performance significantly compared to previous methods and showed good generalization to new problems.

Multimodal large language modelsMicroscopyKnowledge graphBiomedical entitiesEntity-relation extractionEmbedding spaceScientific image-caption corporaInference augmentationQwen3-VLMicroVQA benchmark

Authors

Manyu Li, Ruian He, Chenxi Ma, Weimin Tan, Bo Yan

Abstract

Multimodal large language models (MLLMs) show remarkable potential for scientific reasoning, yet their performance in specialized domains such as microscopy remains limited by the scarcity of domain-specific training data and the difficulty of encoding fine-grained expert knowledge into model parameters. To bridge the gap, we introduce MicroWorld, a framework that constructs a multimodal attributed property graph (MAPG) from large-scale scientific image--caption corpora and leverages it to augment MLLM reasoning at inference time without any domain-specific fine-tuning. MicroWorld extracts biomedical entities and relations via scispaCy or LLM-based triplet mining, aligns images and entities in a shared embedding space using Qwen3-VL-Embedding, and assembles a knowledge graph comprising approximately 111K nodes and 346K typed edges spanning eight relation categories. At inference time, a graph-augmented retrieval pipeline matches query entities to the MAPG and injects structured knowledge context into the MLLM prompt. On the MicroVQA benchmark, MicroWorld improves the reasoning performance of Qwen3-VL-8B-Instruct by 37.5%, outperforming GPT-5 by 13.0% to achieve a new state-of-the-art. Furthermore, it yields a 6.0% performance gain on the MicroBench benchmark. Extensive experiments demonstrate the enhanced generalization capability introduced by MicroWorld. A qualitative case study further reveals both the mechanisms through which structured knowledge improves reasoning and the failure modes that point to promising future directions. Code and data are available at https://github.com/ieellee/MicroWorld.

View PDFOpen arXiv