G-Zero: Self-Play for Open-Ended Generation from Zero Data

2026-05-11 • Machine Learning

Machine LearningArtificial IntelligenceComputation and LanguageEmerging Technologies

AI summaryⓘ

The authors present G-Zero, a method for language models to improve themselves without needing an external judge to verify their answers. They introduce a special reward called Hint-δ, which measures how much a model’s answer changes when given a helpful hint it created itself. One model (the Proposer) generates tricky questions and hints, while another (the Generator) learns to get better at answering using those hints. This setup lets models evolve by learning from their own internal signals rather than relying on outside feedback, which helps overcome some limits of previous methods.

Large Language Models (LLMs)Self-improvementIntrinsic RewardHint-δProposer ModelGenerator ModelDistributional Policy Optimization (DPO)GRPOAutonomous LearningReward Hacking

Authors

Chengsong Huang, Haolin Liu, Tong Zheng, Runpeng Dai, Langlin Huang, Jinyuan Li, Zongxia Li, Zhepei Wei, Yu Meng, Jiaxin Huang

Abstract

Self-evolving LLMs excel in verifiable domains but struggle in open-ended tasks, where reliance on proxy LLM judges introduces capability bottlenecks and reward hacking. To overcome this, we introduce G-Zero, a verifier-free, co-evolutionary framework for autonomous self-improvement. Our core innovation is Hint-$δ$, an intrinsic reward that quantifies the predictive shift between a Generator model's unassisted response and its response conditioned on a self-generated hint. Using this signal, a Proposer model is trained via GRPO to continuously target the Generator's blind spots by synthesizing challenging queries and informative hints. The Generator is concurrently optimized via DPO to internalize these hint-guided improvements. Theoretically, we prove a best-iterate suboptimality guarantee for an idealized standard-DPO version of G-Zero, provided that the Proposer induces sufficient exploration coverage and the data filteration keeps pseudo-label score noise low. By deriving supervision entirely from internal distributional dynamics, G-Zero bypasses the capability ceilings of external judges, providing a scalable, robust pathway for continuous LLM self-evolution across unverifiable domains.

View PDFOpen arXiv