Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining

2026-06-24Machine Learning

Machine LearningArtificial IntelligenceComputation and Language
AI summary

The authors found that during training, a small language model learns a rule linking girls' names to the pronoun 'she', but later unlearns it even though the training data still supports it. They call this process 'natural ungrokking,' where which rules persist depends on how often the data shows that rule winning over others. They show that a competing pattern replaces the learned rule, and trying to remove or restore the rule by changing the data is easier to break than to fix. These findings were confirmed across different datasets and model sizes.

language modelpronoun-gender ruletraining dynamicsnatural ungrokkingcorpus statisticsmodel forgettinglog-probability marginbehavioral collapsedata-to-parameter ratiomodel scaling
Authors
Juliana Li, Diya Sreedhar
Abstract
Midway through an ordinary pretraining run, a small language model learns the pronoun-gender rule: cued with a girl's name ("Sue cried because"), it resolves the next pronoun to she, generalizing to held-out probes (0.94 by step 925). By step 3,500 the same model scores near zero on the same probes, although the rule's evidence is still in the training data. We call this within-run reversal natural ungrokking: the corpus decides, with no trace in the loss curve, which learned rules a model keeps. Which rules survive is predictable from one corpus statistic: how often the training stream shows the rule winning. Across un-intervened runs (two corpora, three budgets, three seeds), support frequency decides a rule's fate; the data-to-parameter ratio only modulates how deeply a doomed rule falls. The same emerge-then-collapse dynamics appear in public Pythia checkpoints, collapse depth ordered by model scale as predicted. The forgetting is a displacement: a competing surface pattern out-competes the rule, and the log-probability margin between them crosses zero within 100 training steps of the behavioral collapse. Control over this fate is asymmetric: the same edit that destroys a rule on demand cannot restore it. Flipping support to counter-evidence in place kills the rule with monotone dose-response in two unrelated rules; but injecting support back, even to 450 times the level that naturally sustains it, buys no recovery. Every confirmatory threshold and prediction was pre-registered before the data it governed was read.