EVENT5Ws: A Large Dataset for Open-Domain Event Extraction from Documents
2026-04-23 • Computation and Language
Computation and Language
AI summaryⓘ
The authors created a new dataset called EVENT5Ws to help computers better identify important details about events from text. This dataset is large, carefully checked by humans, and covers many types of events from different topics and places. They tested current advanced language models using this dataset and found that models trained on it work well even with events from other regions. The authors also share what they learned about making such datasets to help others build similar tools in the future.
event extractiondataset annotationopen-domainlarge language modelsmachine learningnatural language processingbenchmark datasetgeneralizationdata verificationinformation extraction
Authors
Praval Sharma, Ashok Samal, Leen-Kiat Soh, Deepti Joshi
Abstract
Event extraction identifies the central aspects of events from text. It supports event understanding and analysis, which is crucial for tasks such as informed decision-making in emergencies. Therefore, it is necessary to develop automated event extraction approaches. However, existing datasets for algorithm development have limitations, including limited coverage of event types in closed-domain settings and a lack of large, manually verified dataset in open-domain settings. To address these limitations, we create EVENT5Ws , a large, manually annotated, and statistically verified open-domain event extraction dataset. We design a systematic annotation pipeline to create the dataset and provide empirical insights into annotation complexity. Using EVENT5Ws, we evaluate state-of-the-art pre-trained large language models and establish a benchmark for future research. We further show that models trained on EVENT5Ws generalize effectively to datasets from different geographical contexts, which demonstrates its potential for developing generalizable algorithms. Finally, we summarize the lessons learned during the dataset development and provide recommendations to support future large-scale dataset development.