SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation
2026-04-17 • Computation and Language
Computation and Language
AI summaryⓘ
The authors studied how well large language models (LLMs) can guess which meanings of tricky words make sense in short stories, based on how humans judge plausibility. They created a system that uses reasoning strategies and tested both small fine-tuned models and large ones using examples to help them guess better. Their findings show that big LLMs using a few example prompts perform similarly to humans in deciding word sense plausibility. Combining multiple models also gave slightly better results and matched human agreement patterns more closely.
Large Language ModelsNatural Language UnderstandingWord Sense DisambiguationHomonymous WordsPlausibility ScoringFew-shot PromptingFine-tuningModel EnsemblingSemEvalNarrative Contexts
Authors
Deshan Sumanathilaka, Nicholas Micallef, Julian Hough, Saman Jayasinghe
Abstract
Recent advances in language models have substantially improved Natural Language Understanding (NLU). Although widely used benchmarks suggest that Large Language Models (LLMs) can effectively disambiguate, their practical applicability in real-world narrative contexts remains underexplored. SemEval-2026 Task 5 addresses this gap by introducing a task that predicts the human-perceived plausibility of a word sense within a short story. In this work, we propose an LLM-based framework for plausibility scoring of homonymous word senses in narrative texts using a structured reasoning mechanism. We examine the impact of fine-tuning low-parameter LLMs with diverse reasoning strategies, alongside dynamic few-shot prompting for large-parameter models, on accurate sense identification and plausibility estimation. Our results show that commercial large-parameter LLMs with dynamic few-shot prompting closely replicate human-like plausibility judgments. Furthermore, model ensembling slightly improves performance, better simulating the agreement patterns of five human annotators compared to single-model predictions