Counterfactual learning of new adaptive instructional policies using logged data

2026-06-22Machine Learning

Machine Learning
AI summary

The authors introduce a method to improve how intelligent tutoring systems choose lessons by learning from past student interactions instead of costly experiments. They use a model that measures both student ability and question difficulty on a continuous scale, treating teaching as a problem of choosing challenges that keep students engaged ('in flow'). Their approach estimates how previous tutoring decisions were made to better evaluate and improve new teaching strategies. Tests on real data show their method quickly finds better ways to help students without needing more data.

Intelligent Tutoring SystemsContextual BanditsOffline LearningRasch ModelLatent ProficiencyTask DifficultyReward FunctionOff-Policy EvaluationAdaptive LearningFlow Theory
Authors
Samuel Girard, Sein Minn, Amel Bouzeghoub, Jill-Jênn Vie
Abstract
Optimizing instructional policies in Intelligent Tutoring Systems (ITS) typically requires costly online experimentation or student simulators that may fail to capture real-world dynamics. This paper introduces an offline contextual bandit framework that learns new adaptive policies directly from logged interaction data. By mapping student-item interactions onto a continuous latent proficiency-difficulty scale using a Rasch model, we cast the tutoring process as a continuous stochastic bandit problem. We propose a novel reward function designed to optimize ''flow'' by balancing task challenge with student success. Our approach includes a round-specific behavior policy estimation that serves as both a propensity model for off-policy evaluation and a diagnostic tool for ITS adaptivity. We demonstrate the efficacy of this framework across four large-scale real-world datasets, achieving consistent policy improvements over the logged behavior policy. The results show that effective instructional policies can be learned and visualized within seconds of computation, providing a scalable path for improving adaptive learning systems without further data collection.