ManimAgent: Self-Evolving Multimodal Agents for Visual Education

2026-06-29Artificial Intelligence

Artificial Intelligence
AI summary

The authors study how an AI can learn from past tasks to improve future performance on coding challenges without forgetting previous lessons. They create ManimAgent, which remembers what worked well and what failed by storing examples of past successes and mistakes in a special memory bank. This helps the agent get better at writing code for animations in Python over time, judged by a vision-language model checking its results. Their tests show that as the agent’s memory grows, it solves tasks faster and more accurately than versions without memory or with random memories.

multi-round reflectionlarge language modelscode generationepisodic memoryManim libraryvision-language modelretrieval-augmented generationreference examplesknown pitfalls
Authors
Wenjia Jiang, Zongyuan Cai, Yuanhang Shao, Chenru Wang, Boyan Han, Zhixue Song, Keyu Chen, Shengwei An, Xu Yang, Zhou Yang
Abstract
Multi-round reflection lets agents built on large language models recover from failures within a single task, but each task remains an isolated episode: lessons learned across many reflection rounds on one task are discarded before the next begins. We study this gap on a code-generation task: from a scientific paper section, the agent writes Python in the open-source Manim library to render a mathematical animation. We present ManimAgent, a self-evolving multimodal agent that carries reflection experience across tasks through a dual-channel Episodic Memory Bank grown entirely from its own task stream, with no weight updates and no human seeds. After each animation converges, a vision-language model scores the rendered keyframes; the resulting signals populate a positive channel M+ that stores success rationales as soft Reference Examples, and a negative channel M- that stores validated failure patterns as hard Known Pitfalls. On a fixed-probe evaluation against no-memory, matched-budget retrieval-augmented generation, and shuffled-memory baselines, blind human Pass@1 rises and reflection rounds fall as memory size grows. We will release the code, frozen memory snapshots, and the task stream.