Why Machines Misread Pedagogical Quality: Human-Machine Alignment in LLM-Based Pretest Question Evaluation

2026-06-22Human-Computer Interaction

Human-Computer Interaction
AI summary

The authors studied how to use AI to help create good pretest questions for learning. They developed a process where AI generates questions, rates them using a rubric, and then people pick the best ones. They found that disagreements between humans and AI were predictable and related more to how the rubric was written than how questions were evaluated. This means making AI-assisted testing work well depends a lot on clearly defining quality in ways AI can understand. The authors highlight that it's not just about AI creating questions but about how we teach AI to judge them.

pretest questionsAI-assisted generationrubric-based evaluationhuman-machine alignmentcognitive depthlearning objectivesiterative selectionpedagogical qualitymachine interpretation
Authors
Pei-Yu Tseng, Mahir Akgun, Peng Liu
Abstract
Designing effective pretest questions is challenging at scale: high-quality questions require careful calibration of openness, cognitive depth, and alignment with learning objectives, yet generating and evaluating them manually is time-consuming. We present an AI-assisted workflow for pretest question development that combines automated generation, rubric-based evaluation, and iterative selection. Because the workflow relies on machine evaluation to filter questions at scale, we investigate the alignment between human and machine judgments across a 2x2 design varying rubric operationalization and evaluation mode. Our findings show that human-machine disagreements are systematic rather than random, that rubric revision has a larger effect on alignment than rationale-first evaluation, and that the two interventions are complementary. These findings highlight that scalable AI-assisted pretesting depends not only on generation capability but on how pedagogical quality is operationalized for machine interpretation.