How Seemingly Inconsequential Design Choices Dictate Performance of LLMs in Pathology

2026-06-10 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors studied how large language models (LLMs), which are general AI models, are used to analyze very big medical images called whole-slide images (WSIs) in pathology. They found that earlier work underestimated LLMs because they used small image patches and simple methods, which didn’t fully use the LLMs’ abilities. By testing different ways of feeding image patches to LLMs—changing patch size, number, magnification, and processing method—they significantly improved LLM performance on cancer and organ classification tasks. Their better approach also worked across other models and datasets, showing general improvements without special tuning. This suggests LLMs can be more competitive in pathology than previously thought if designed carefully.

large language modelswhole-slide imagespathologypatch sizemagnificationinference modecancer classificationorgan classificationMultiPathQAgeneralist vs specialized models

Authors

Kian R. Weihrauch, Thomas A. Buckley, William Lotter, Arjun K. Manrai

Abstract

General-purpose large language models (LLMs) are routinely used as baselines when evaluating specialized pathology models on whole-slide images (WSIs). Because WSIs exceed contemporary model context limits, LLM baselines routinely use small, high-magnification patches processed independently via majority voting, without systematic evaluation of seemingly inconsequential design choices such as patch size, patch count, and magnification. Generalist LLMs have consistently underperformed specialized systems, reinforcing the perception that domain-specific training or architectural adaptation is necessary for pathology tasks involving WSIs. Here, we conduct a systematic factorial analysis of four input design factors: inference mode, patch size, magnification, and patch count. We demonstrate that prior studies have overstated the gap between specialized models and general-purpose LLMs by choosing non-optimized input configurations. On the MultiPathQA benchmark, switching to a single balanced configuration (large patches at lower magnification, processed jointly) raises GPT-5 from 15.1% to 39.5% on cancer-type classification (TCGA) and from 38.1% to 62.9% on organ classification (GTEx). Per-task optimization yields further gains up to 43.9% (TCGA) and 71.6% (GTEx). The same configuration generalizes to two other models and to a fully held-out CPTAC cohort, where it improves Gemini 3 Flash by 23.4 percentage points without any task-specific tuning.

View PDFOpen arXiv