RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

2026-05-11 • Computation and Language

Computation and LanguageMachine Learning

AI summaryⓘ

The authors address the challenge of training AI research agents that work through multi-step, complex tasks without clear correct answers at each step. They propose RubricEM, a new approach that uses rubrics not just to judge final results but to guide the entire decision process, breaking it into stages like planning and evidence gathering. RubricEM provides detailed feedback during training and helps the agent learn from past attempts to improve future performance. Their method shows strong results on multiple long-form research tasks compared to other models.

reinforcement learningrubricspolicy decompositionmeta-policylong-horizon optimizationcredit assignmentself-reflectiondeep research agentsevidence synthesisstagewise feedback

Authors

Gaotang Li, Bhavana Dalvi Mishra, Zifeng Wang, Jun Yan, Yanfei Chen, Chun-Liang Li, Long T. Le, Rujun Han, George Lee, Hanghang Tong, Chen-Yu Lee, Tomas Pfister

Abstract

Training deep research agents, namely systems that plan, search, evaluate evidence, and synthesize long-form reports, pushes reinforcement learning beyond the regime of verifiable rewards. Their outputs lack ground-truth answers, their trajectories span many tool-augmented decisions, and standard post-training offers little mechanism for turning past attempts into reusable experience. In this work, we argue that rubrics should serve not merely as final-answer evaluators, but as the shared interface that structures policy execution, judge feedback, and agent memory. Based on this view, we introduce RubricEM, a rubric-guided reinforcement learning framework that combines stagewise policy decomposition with reflection-based meta-policy evolution. RubricEM first makes research trajectories stage-aware by conditioning planning, evidence gathering, review, and synthesis on self-generated rubrics. It then assigns credit with Stage-Structured GRPO, which uses stagewise rubric judgments to provide denser semantic feedback for long-horizon optimization. In parallel, RubricEM trains a shared-backbone reflection meta-policy that distills judged trajectories into reusable rubric-grounded guidance for future attempts. The resulting RubricEM-8B achieves strong performance across four long-form research benchmarks, outperforming comparable open models and approaching proprietary deep-research systems. Beyond final performance, we perform thorough analyses to understand the key ingredients of RubricEM.

View PDFOpen arXiv