Cross-Environment Neural Reranking for Sample-Efficient Action Selection in Text-Based Agents
2026-06-01 • Computation and Language
Computation and Language
AI summaryⓘ
The authors studied if one small model can make decisions across different text-based tasks instead of having separate models for each. They trained a DeBERTa-v3 model on three environments and found joint training improved performance on all tasks compared to training separately. Their method also quickly adapts to new tasks with little data and shows that having diverse training examples is more important than making the model bigger. They highlight a new technique using adapters that shows promise but is unstable. The authors plan to share their dataset and models for others to use.
large language modelsaction selectionDeBERTa-v3multi-environment trainingminority-class upsamplingcross-domain transferfine-tuningLoRA adaptersPCGradjoint training
Authors
Kan Shao
Abstract
Large language model agents achieve strong performance on text-based benchmarks but incur prohibitive inference costs, motivating the use of compact neural rerankers for action selection. We investigate whether a single lightweight model can perform action selection across multiple diverse environments, a capability that would eliminate per-environment model maintenance. Training DeBERTa-v3 (184M-434M parameters) jointly on ALFWorld, WebShop, and ScienceWorld with minority-class upsampling, we find that rebalanced two-environment joint training substantially improves over single-environment ALFWorld performance (net gain +0.412) while maintaining competitive WebShop performance (+0.214 vs. +0.249 single-environment). Three-environment training yields a mean combined net gain of +0.551 +/- 0.024 across 4 seeds, with per-environment results approaching specialized single-environment models while providing positive cross-domain transfer. Cross-environment adaptation is highly sample-efficient: fine-tuning on only 9.2% of target-domain data recovers 93% of full-data performance, and scaling model capacity yields limited benefits, indicating data diversity is the primary driver. Environment-aware LoRA adapter routing with PCGrad achieves a best-seed result of +0.611 (seed 42), with seeds 456 and 789 at +0.554 and +0.559, but exhibits high variance due to seed 123 collapsing to +0.263 (4-seed mean +0.497 +/- 0.158), representing a promising but currently unstable direction. Joint training with clean splits and data rebalancing is a key ingredient. We will release our three-environment benchmark of 51,580 training instances (41,740 raw unique states with minority-class upsampling) and all model checkpoints upon acceptance.