HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering
2026-05-31 • Computation and Language
Computation and Language
AI summaryⓘ
The authors created HypothesisMed, a method to check how reliable biomedical question-answering AI models are beyond just checking if their answers are right. It looks at whether models give answers that can be easily checked, follow rules about answer reliability, and avoid confidently giving wrong answers. They tested this method with several AI models on different medical question sets and showed it improved how often models gave reliable output, not just correct answers. The main idea is to provide a clear way to audit and trust AI answers in medicine, rather than just improving accuracy.
biomedical question answeringlarge language modelschain-of-thought promptinginference-time pipelineanswer accuracyparseabilitySPACE labelsconfidence calibrationanswer fusionreliability evaluation
Authors
Md Motaleb Hossen Manik, Ge Wang
Abstract
Biomedical question answering with large language models is commonly evaluated using answer accuracy, but answer accuracy alone does not indicate whether a model can produce parseable outputs, follow structured reliability instructions, recognize weak answer spaces, or avoid confident incorrect commitments. This paper presents HypothesisMed, an inference-time reliability pipeline for biomedical multiple-choice question answering. It combines direct, chain-of-thought, HypothesisMed-v3 prompting, and answer fusion. The final answer is selected by fusion, while HypothesisMed-v3 supplies SPACE labels and confidence information. SPACE labels mark the answer space as VALID, INCOMPLETE, or CONTRADICTED. We evaluate Qwen2.5-7B, Phi-4-mini, DeepSeek-R1-32B, and BioMistral-7B on MedQA, MedMCQA, and PubMedQA using 1,000 examples per dataset. The pipeline improves weighted accuracy over each model's best direct or chain-of-thought baseline while increasing parse and SPACE coverage. We also scale evaluation to Qwen2.5-7B and Phi-4-mini using 10,183 examples per model. Fusion improves Phi-4-mini accuracy from 0.4296 to 0.5192, while Qwen2.5-7B chain-of-thought remains slightly higher in answer accuracy. However, Qwen2.5-7B fusion achieves complete parse and SPACE coverage with much lower false commitment. A 12,000-example SPACE stress test shows answer-space diagnosis remains difficult, with SPACE accuracy of 0.3074 for Qwen2.5-7B and 0.4168 for Phi-4-mini. These results show that answer accuracy, parseability, structured reliability reporting, calibration behavior, and false-commitment behavior are separable capabilities. The main contribution is not a universal state-of-the-art claim, but a reproducible inference-time framework for evaluating biomedical question answering models as auditable workflow components under structured reliability constraints.