PlantMarkerBench: A Multi-Species Benchmark for Evidence-Grounded Plant Marker Reasoning

2026-05-11 • Computation and Language

Computation and Language

AI summaryⓘ

The authors created PlantMarkerBench, a tool to check how well computers can find and understand plant gene markers from scientific papers. It focuses on four plants and includes thousands of examples showing if sentences correctly describe gene-cell relationships and what type of evidence they provide. They tested different AI models and found that while these models do well with direct gene expression information, they struggle with more complex or weaker evidence. This benchmark helps improve AI methods for accurately extracting scientific facts about plants from research articles.

plant marker genesgene expressionbiological evidenceliterature miningAI language modelsbenchmark datasetArabidopsismaizericetomato

Authors

Sajib Acharjee Dip, Song Li, Liqing Zhang

Abstract

Cell-type-specific marker genes are fundamental to plant biology, yet existing resources primarily rely on curated databases or high-throughput studies without explicitly modeling the supporting evidence found in scientific literature. We introduce PlantMarkerBench, a multi-species benchmark for evaluating literature-grounded plant marker evidence interpretation from full-text biological papers. PlantMarkerBench is constructed using a modular curation pipeline integrating large-scale literature retrieval, hybrid search, species-aware biological grounding, structured evidence extraction, and targeted human review. The benchmark spans four plant species -- Arabidopsis, maize, rice, and tomato -- and contains 5,550 sentence-level evidence instances annotated for marker-evidence validity, evidence type, and support strength. We define two benchmark tasks: determining whether a candidate sentence provides valid marker evidence for a gene-cell-type pair, and classifying the evidence into expression, localization, function, indirect, or negative categories. We benchmark diverse open-weight and closed-source language models across species and prompting strategies. Although frontier models achieve relatively strong performance on direct expression evidence, performance drops substantially on functional, indirect, and weak-support evidence, with evidence-type confusion emerging as a dominant failure mode. Open-weight models additionally exhibit elevated false-positive rates under ambiguous biological contexts. PlantMarkerBench provides a challenging and reproducible evaluation framework for literature-grounded biological evidence attribution and supports future research on trustworthy scientific information extraction and AI-assisted plant biology.

View PDFOpen arXiv