Research Entity Extraction and Topic Detection from UKRI Grant Proposals

2026-06-29 • Digital Libraries

Digital LibrariesArtificial IntelligenceInformation Retrieval

AI summaryⓘ

The authors compared three methods using large language models (LLMs) to find and organize research topics from UK funding proposals. They tested GPT-4o, Mistral, and a custom method called DSIT-Taxonomies on 42 proposal abstracts. They found that Mistral and GPT-4o identified research topics similarly well, but Mistral was better at correctly classifying those topics. The DSIT-Taxonomies method was less consistent and accurate. Overall, the authors suggest Mistral is a reliable and efficient tool for analyzing sensitive research funding data at scale.

large language modelsentity extractiontopic classificationfunding proposalsOpenAlex TopicsMistralGPT-4oDSIT-Taxonomiessemantic overlapresearch taxonomies

Authors

Xingran Ruan, Angelo Salatino, Rosa Filgueira, Kara Moraw, Alexandru Marcoci, Gemma Derrick, Sarah Callaghan

Abstract

This paper presents preliminary findings from a UKRI-funded Metascience project comparing three LLM-based approaches, GPT-4o, Mistral, and a bespoke algorithm, DSIT-Taxonomies, for extracting and classifying research entities from funding proposals. Our project "Tracking Stars and Unicorns" aims to identify early signals of emerging research areas to inform public investment. Our methodology employed a three-stage pipeline, leveraging Mistral for primary entity extraction and mapping against the OpenAlex Topics taxonomy. We evaluated our approach across 42 proposals' abstracts from different areas and observed that Mistral and GPT-4o produce comparable, high-quality entity sets with significant semantic overlap, outperforming the fragmented DSIT-Taxonomies approach. Crucially, the Mistral-based approach achieved superior topic classification accuracy (90.5%) compared to the full DSIT-Taxonomies pipeline (71.4%). We conclude that Mistral offers a high-performance, operationally efficient, and secure solution for large-scale analysis of sensitive grant data.

View PDFOpen arXiv