How Far Can You Get Without a GPU? A Systematic Benchmark of Lightweight Hallucination Detection Across Question Answering, Dialogue, and Summarisation

2026-06-29 • Computation and Language

Computation and LanguageArtificial Intelligence

AI summaryⓘ

The authors study how well computer programs can detect false information generated by AI without needing powerful GPUs or special access to the AI models. They test five simple, CPU-friendly methods on three types of tasks: answering questions, chatting, and summarizing text. Their results show that no one method works best for all tasks, and these methods struggle a lot with summarization. The authors provide guidance on which methods to use depending on the task when computational resources are limited.

hallucination detectionGPUCPUROUGE-LBERTScoreNatural Language Inference (NLI)DeBERTaFEVER datasetHaluEval benchmarkAUC-ROC

Authors

Kriti Faujdar, Smit Kadvani

Abstract

Hallucination detection has become a pressing requirement for trustworthy AI deployment at scale. The most accurate detection methods depend on GPU-intensive inference, proprietary API calls, or white-box access to the generating model. This puts them out of reach for resource-constrained researchers and practitioners. In this paper, we explore a practical alternative: how well can hallucination detection perform using only lightweight, CPU-feasible methods built on publicly available models? We systematically benchmark five such methods: ROUGE-L, semantic similarity, BERTScore, a Natural Language Inference (NLI) detector based on a FEVER-trained DeBERTa model, and a score-level ensemble of similarity and NLI. We evaluate them across all three tasks of the HaluEval benchmark: question answering (QA), dialogue, and summarisation. We calibrate each method on a held-out validation split and evaluate it on 2,000 test instances per task. We find that no single method dominates and performance is highly task-dependent. The ensemble performs best on QA (F1 = 0.792, AUC-ROC = 0.873), the NLI detector leads on dialogue (AUC-ROC = 0.713), and all five methods degrade to near-random performance on summarisation (AUC-ROC between 0.469 and 0.574). This task-dependence and the systematic failure on summarisation map the practical frontier of GPU-free hallucination detection. They give practical guidance for method selection under computational constraints. All experiments run on a standard laptop CPU using public models.

View PDFOpen arXiv