Measuring Whether LLM Tutors Teach or Solve: A Diagnostic for Educational Impact

2026-06-15 • Artificial Intelligence

Artificial IntelligenceComputation and LanguageComputers and SocietyHuman-Computer Interaction

AI summaryⓘ

The authors studied how well large language models (LLMs) help students learn, not just how well they solve problems. They found that a model that gets the right answers doesn't always provide good teaching support, like asking helpful questions or giving hints. By comparing different models on a tutoring benchmark, the authors showed that solving problems and supporting learning are related but different skills. They suggest that future evaluations should measure these skills separately to better understand a model's educational impact.

large language modelseducational tutoringlearning supportproblem solvingbenchmarkspedagogyscaffoldingactive learningevaluation metricsstudent agency

Authors

Junyi Yao, Zihao Zheng, Baichuan Li

Abstract

Large language models are increasingly proposed as educational tutors, yet stronger task-solving ability does not necessarily imply stronger learning support. Motivated by recent calls to measure the social impact of NLP systems in practice, we study whether public LLM tutoring benchmarks distinguish learning-supportive behavior from mere answer production. We propose a lightweight diagnostic based on the gap between solving-oriented and pedagogy-oriented benchmark performance. Using public MathTutorBench leaderboard results, we show that these dimensions are only partially aligned: across eight publicly reported models, the correlation between solving and pedagogy composites is 0.421, and several models shift meaningfully in rank when evaluation moves from solving to pedagogy. We then analyze the public TutorBench sample and show that agency-relevant behaviors are explicitly encoded in benchmark rubrics, especially in active-learning settings that reward guiding questions, calibrated hints, and non-disclosive scaffolding. Together, these findings suggest that educational-impact evaluation should not treat task success as a sufficient proxy for learning support. We argue that public tutoring benchmarks can better support positive-impact evaluation by reporting solving-oriented and pedagogy-oriented scores separately and by making disclosure-sensitive, student-agency-preserving criteria more explicit.

View PDFOpen arXiv