AI summaryⓘ
The authors present DuoMem, a method that helps smaller language models learn to solve step-by-step tasks as well as bigger models, but using much fewer resources. DuoMem works by teaching smaller models using two approaches: replacing their memory with better memories created by a large model and fine-tuning small parts of the smaller model based on how the large model solves tasks. When tested on a difficult task environment called ALFWorld, DuoMem improved a 4-billion-parameter model's success rate a lot, almost matching a much larger 72-billion-parameter model, while being faster and using less memory. This way, the smaller model can do complex tasks quickly on devices with limited computing power. The authors show that both parts of DuoMem help the model get better in different ways.
Large Language ModelsModel DistillationProcedural MemoryContext-space DistillationParameter-space DistillationALFWorldLoRA AdaptersEdge DeploymentEmbodied Decision-Making
Authors
Peyman Hosseini, Ondrej Bohdal, Ahmed Alajrami, Andrea Maracani, Ignacio Castro, Matthew Purver, Mete Ozay, Savas Ozkan, Taha Ceritli
Abstract
Large Language Model (LLM)-based agents can solve complex procedural tasks by interacting with environments over multiple turns, but this ability typically depends on large models, long contexts, and repeated inference calls. This makes advanced memory-augmented agents difficult to deploy on resource-constrained devices. We introduce DuoMem, a dual-space distillation framework that transfers procedural problem-solving ability from a large teacher model to compact student models. DuoMem distils in two complementary spaces: (1)context-space distillation, which replaces student-generated memories with higher-quality teacher-generated procedural memories prepended to the student's input, and (2)parameter-space distillation, which fine-tunes lightweight LoRA adapters on successful teacher trajectories. Evaluated on ALFWorld, a challenging embodied decision-making benchmark, DuoMem boosts a 4B-parameter model from 4.3% to 77.9% task success rate, closing most of the gap to a 72B teacher model (87.1%), while adding fewer than 10M trainable parameters and only a few megabytes of pre-computed teacher memories. Moreover, the DuoMem-enhanced 4B model completes tasks over 3x faster than the 72B teacher in wall-clock time, making it viable for real-time edge deployment, which would be challenging for the teacher.Extensive ablations across eight models spanning 2B-72B parameters reveal that both distillation axes contribute complementary