Weaving Multi-Source Evidence for Biomedical Reasoning: The BioMedHop Benchmark and BioWeave Framework
2026-06-15 • Computation and Language
Computation and Language
AI summaryⓘ
The authors created BioMedHop, a new test to check how well computers can answer medical questions using information from different sources like databases, documents, and the web. They noticed that past tests didn't focus enough on combining these sources into complex reasoning paths. To tackle this, they also made BioWeave, a system that smartly gathers clues from all these sources to build a connected web of evidence and find the right answers. Their tests showed that BioWeave works better than existing methods and even helps smaller models think more like bigger, more powerful ones.
biomedical question answeringknowledge graphmulti-hop reasoningevidence retrievalevidence graphlarge language modelshybrid evidenceBioMedHopBioWeaveentity-level verification
Authors
Xingyu Tan, Shiyuan Liu, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, Liming Zhu, Wenjie Zhang
Abstract
Biomedical question answering (QA) increasingly requires reasoning over interacting entities, where supporting evidence is scattered across biomedical knowledge graphs, literature documents, and web-accessible resources. However, existing biomedical QA benchmarks mainly focus on exam-style knowledge, literature comprehension, or short-range multi-hop inference, leaving source-conditioned graph reasoning and evidence topology construction underexplored. To fill this gap, we introduce BioMedHop, a multi-source graph-grounded benchmark for evaluating biomedical reasoning over structured evidence topologies. BioMedHop contains 10,045 instances across KG, document, web, and hybrid evidence settings, covering shared-neighbor matching, intersection reasoning, path-based reasoning, and counting, with option-based, open-ended, and numeric count renderings. To support this benchmark, we further propose BioWeave, a source-aware reasoning framework that retrieves biomedical KG paths, gathers supporting clues from documents and web sources, assembles them into a unified evidence graph, and verifies answers through entity-level evidence support. Comprehensive experiments show that BioWeave achieves the best overall performance among compared methods on BioMedHop, outperforming the strong hybrid baseline ToG-2 by 10.5% in the overall average. Moreover, BioWeave consistently improves different LLM backbones and enables smaller models, such as Qwen3-4B, to achieve reasoning performance comparable to GPT-4-Turbo.