Continual Harness: Online Adaptation for Self-Improving Foundation Agents

2026-05-11 • Machine Learning

Machine LearningArtificial Intelligence

AI summaryⓘ

The authors created a system called Gemini Plays Pokemon (GPP) that improved how AI plays long, complex games like Pokemon by letting the AI refine its strategies over time, even without human help. They then developed Continual Harness, an approach that lets an AI agent learn and improve continuously without restarting the game, using its past experiences to get better. This method worked well on different Pokemon versions, making the AI play more efficiently starting from scratch and closing much of the gap to expert systems built by humans. Finally, they combined this with a learning loop where the AI's improvements help train the model further, allowing steady game progress without resetting.

Foundation modelsEmbodied agentsPartial observabilityHuman-in-the-loopPrompt optimizationLong-context memorySelf-improvementContinual learningSelf-refinementProcess-reward co-learning

Authors

Seth Karten, Joel Zhang, Tersoo Upaa, Ruirong Feng, Wenzhe Li, Chengshuai Shi, Chi Jin, Kiran Vodrahalli

Abstract

Coding harnesses such as Claude Code and OpenHands wrap foundation models with tools, memory, and planning, but no equivalent exists for embodied agents' long-horizon partial-observability decision-making. We first report our Gemini Plays Pokemon (GPP) experiments. With iterative human-in-the-loop harness refinement, GPP became the first AI system to complete Pokemon Blue, Yellow Legacy on hard mode, and Crystal without a lost battle. In the hardest stages, the agent itself began iterating on its strategy through long-context memory, surfacing emergent self-improvement signals alongside human-in-the-loop refinement. Continual Harness removes the human fully from this loop: a reset-free self-improving harness for embodied agents that formalizes and automates what we observed. Starting from only a minimal environment interface, the agent alternates between acting and refining its own prompt, sub-agents, skills, and memory, drawing on any past trajectory data. Prompt-optimization methods require episode resets; Continual Harness adapts online within a single run. On Pokemon Red and Emerald across frontier models, Continual Harness starting from scratch substantially reduces button-press cost relative to the minimalist baseline and recovers a majority of the gap to a hand-engineered expert harness, with capability-dependent gains, despite starting from the same raw interface with no curated knowledge, no hand-crafted tools, and no domain scaffolding. We then close the loop with the model itself: an online process-reward co-learning loop, in which an open-source agent's rollouts through the refining harness are relabeled by a frontier teacher and used to update the model, drives sustained in-game milestone progress on Pokemon Red without resetting the environment between training iterations.

View PDFOpen arXiv