Continual Visual and Verbal Learning Through a Child's Egocentric Input
2026-06-03 • Computer Vision and Pattern Recognition
Computer Vision and Pattern RecognitionArtificial IntelligenceComputation and Language
AI summaryⓘ
The authors created BabyCL, a new learning system that mimics how children learn words by processing video and sound in the order they naturally occur, without repeating data many times. This system learns to connect words to what they see in videos more like kids do, by watching videos once in sequence and using special techniques to remember important parts. They tested BabyCL on a dataset of children's videos and found it learns better than other models that try to do streaming learning. Their results suggest that word learning can happen in a more realistic, continuous way similar to how children experience the world.
continual learningmultimodal learningegocentric videocontrastive learningstreaming datareplay bufferword-referent mappingtemporal segmentationSAYCam dataset4AFC benchmark
Authors
Xiaoyang Jiang, Yanlai Yang, Kenneth A. Norman, Brenden Lake, Mengye Ren
Abstract
Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also learn word-referent mappings from a child's egocentric video recordings, but they cycle through the shuffled data for hundreds of epochs, contrasting with how children actually encounter their environment. We introduce BabyCL, a continual multimodal learning framework that processes the SAYCam dataset in a single chronological pass, combining streaming visual representation learning with an image-text contrastive objective. BabyCL combines a multi-stage temporal segmentation of the stream with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone. Under a matched optimization budget, BabyCL outperforms streaming learning baselines on the SAYCam Labeled-S 4AFC benchmark, substantially narrowing the gap to an upper bound of offline training. Ablations show that the gains are robust to the length of the online temporal segmentation window and the eviction rule of the replay buffer. Together, these results show that meaningful word-referent mappings can emerge under training conditions much closer to a child's actual experience.