Beyond Majority Voting: Agreement-Based Clustering to Model Annotator Perspectives in Subjective NLP Tasks

2026-05-11Computation and Language

Computation and Language
AI summary

The authors study how people disagree when labeling data for language tasks like sentiment analysis, emotion classification, and hate speech detection. Instead of just taking the most common answer, they group annotators based on how much they agree with each other. They tested this method on 40 datasets from 18 languages and found that these groups help improve task performance more than just majority voting or modeling individual annotators. They also show that certain ways of combining these groups, like multi-label and multitask learning, work better than other methods.

annotator disagreementmajority votingannotation aggregationsentiment analysisemotion classificationhate speech detectionagreement-based clusteringmulti-label learningmultitask learning
Authors
Tadesse Destaw Belay, Ibrahim Said Ahmad, Idris Abdulmumin, Abinew Ali Ayele, Alexander Gelbukh, Eusebio Ricárdez-Vázquez, Olga Kolesnikova, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam
Abstract
Disagreement in annotation is a common phenomenon in the development of NLP datasets and serves as a valuable source of insight. While majority voting remains the dominant strategy for aggregating labels, recent work has explored modeling individual annotators to preserve their perspectives. However, modeling each annotator is resource-intensive and remains underexplored across various NLP tasks. We propose an agreement-based clustering technique to model the disagreement between the annotators. We conduct comprehensive experiments in 40 datasets in 18 typologically diverse languages, covering three subjective NLP tasks: sentiment analysis, emotion classification, and hate speech detection. We evaluate four aggregation approaches: majority vote, ensemble, multi-label, and multitask. The results demonstrate that agreement-based clustering can leverage the full spectrum of annotator perspectives and significantly enhance classification performance in subjective NLP tasks compared to majority voting and individual annotator modeling. Regarding the aggregation approach, the multi-label and multitask approaches are better for modeling clustered annotators than an ensemble and model majority vote.