EvoPool: Evolutionary Programmatic Annotation for Label-Efficient Specialized Supervision

2026-06-01Computation and Language

Computation and LanguageArtificial Intelligence
AI summary

The authors present EvoPool, a method designed to improve performance in specialized tasks where training data is limited and expensive. EvoPool uses a group of small programs (agents) that evolve over time to create better ways to label data by testing and selecting the best performers. Their approach is much faster than using large language models directly and achieves better accuracy on several complex tasks in fields like biomedicine and law. They also introduce EvoAgg, a tool to combine votes from these agents into useful training data. Overall, the authors show EvoPool can outperform previous best methods in specialized labeling tasks.

Large Language ModelsSupervised LearningEvolutionary AlgorithmsData AnnotationMulti-Agent SystemsBiomedical Relation ExtractionLegal Text ClassificationMacro-F1 ScoreSemantic FeaturesModel Aggregation
Authors
Tianyi Xu, Yaolun Zhang, Xuan Ouyang, Huazheng Wang
Abstract
Large language models excel at general tasks but underperform smaller supervised models in specialized, high-stakes domains where training labels are costly. We address this regime with EvoPool, an evolutionary multi-agent framework inspired by Darwinian evolution. Three specialized agents iteratively propose executable annotator code, a small validation set provides a fitness signal, and a deterministic gate keeps only annotators that pass viability, diversity, and marginal-contribution checks across generations. Pool votes are mapped to soft training labels by EvoAgg, a text-aware aggregator combining semantic features with annotator-vote features. The authored pool runs at near-zero per-example cost and is 4500 to 31000x faster than LLM annotation on 100K examples. Across 7 of 8 LLM-weak specialized and complex tasks spanning biomedical relation extraction, legal-clause classification, complex reasoning, and dense multi-label biomedical classification, EvoPool beats the strongest LLM annotation baseline by an average +0.141 macro-F1, peaking at +0.301 on ChemProt and +0.265 on PubMed. Code is available at: https://github.com/tianyi0216/EvoPool