A Controlled Synthetic Benchmark for Educational Aspect-Based Sentiment Analysis

2026-05-25 • Computation and Language

Computation and LanguageArtificial Intelligence

AI summaryⓘ

The authors created a large fake dataset of 10,000 student course reviews with detailed labels about different educational aspects, like teaching quality and course management, to help improve sentiment analysis in education. They tested various models on this synthetic data and found the task is challenging, with BERT-based models performing best. They also checked how well models trained on fake data work on some real student reviews and found partial success. The authors provide their dataset, methods, and benchmark to help others work on educational sentiment analysis where real labeled data is hard to get.

Aspect-Based Sentiment AnalysisSynthetic DataEducational FeedbackBERTMicro-F1 ScoreZero-Shot LearningFew-Shot PromptingNatural Language ProcessingDomain AdaptationSentiment Classification

Authors

Yehudit Aperstein, Alexander Apartsin

Abstract

Educational aspect-based sentiment analysis (ABSA) can support course improvement, but public aspect-labeled student feedback remains scarce because educational reviews are private, institution-specific, and expensive to annotate. This study introduces a controlled synthetic benchmark for educational ABSA built from 10,000 synthetic course reviews with explicit train-validation-test splits and a 20-aspect pedagogical schema spanning instructional quality, assessment and course management, learning demand, learning environment, and engagement. The corpus is generated with sampled target labels, sampled nuance attributes, and a realism-tuned prompt refined through a three-cycle judge-editor procedure. On the resulting benchmark, local baselines with TF-IDF, two-step transformers, and joint encoders show that the task is nontrivial; the strongest untuned model, BERT, reaches a held-out detection micro-F1 of 0.2760, while a modest lower-rate BERT schedule improves this to 0.2930. Full-test GPT-based inference with gpt-5.2 reaches 0.2519 micro-F1 in zero-shot mode and 0.2501 with retrieval-based few-shot prompting, placing batch inference above the classical baseline and close to the compact joint encoders. A conservative external evaluation on 2,829 mapped student-feedback reviews from Herath et al. yields a micro-F1 of 0.4593 for BERT on a 9-aspect overlap, indicating partial synthetic-to-real transfer. Realism and faithfulness analyses are reported as generator diagnostics that clarify how the benchmark was stabilized and where label noise remains. The study therefore contributes a synthetic educational ABSA corpus, a documented generation procedure, and a reproducible benchmark setting for a domain in which public labeled data remain difficult to obtain.

View PDFOpen arXiv