REFLEX: Reflective Evolution from LLM Experience
2026-06-15 • Computation and Language
Computation and LanguageMachine Learning
AI summaryⓘ
The authors describe REFLEX, a system that improves how AI models use visual information to fix and improve program policies step-by-step. Instead of mixing understanding of visual behavior and code fixing in one step like before, they split these tasks into two parts: a Critic that interprets visual evidence and an Actor that writes new code based on this understanding and past experiences. This separation makes the process easier to follow and lets the system remember useful code tricks across runs. Their tests show REFLEX quickly finds good solutions in control problems and complex tasks using very few calls to the language model.
multimodal language modelsevolutionary searchprogrammatic policiesvisual diagnosiscode synthesistransparent mutationskill memorycontrol benchmarkssample efficiencybehavioral evidence
Authors
Pan Wang
Abstract
Large multimodal language models (LLMs) have emerged as powerful tools for guiding evolutionary search toward interpretable programmatic policies. However, existing frameworks rely on a monolithic model call to simultaneously interpret visual behavioral evidence and synthesize corrective code. This diagnosis-repair entanglement creates an opaque feedback loop, obscuring the rationale behind mutations and preventing the retention of algorithmic insights across independent runs. To achieve auditable and efficient policy search, we argue that visual diagnosis must be structurally decoupled from code generation. We present REFLEX, a train-free evolutionary framework that operationalizes this decoupling. In REFLEX, a vision-enabled Critic first distills task-specific behavioral evidence into structured, auditable diagnoses. Subsequently, a text-optimized Actor synthesizes child policies using these diagnoses alongside a persistent, self-evolving Skill Memory of reusable code snippets. This architecture not only provides transparent mutation traces but also enables cross-run programmatic knowledge transfer. Extensive evaluations across control benchmarks (Lunar Lander, Acrobot, Pendulum) and a 36-dimensional antenna array synthesis task demonstrate exceptional sample efficiency. Notably, REFLEX solves Acrobot and Pendulum in under 10 LLM calls and reaches a best Normalized Weighted Score of 1.092 on Lunar Lander, achieving highly competitive final performance while significantly accelerating the early-stage discovery of transparent policies.