Learning Dynamics Reveal a Hierarchy of Weight-Induced Layerwise Gram Metrics

2026-06-08Machine Learning

Machine Learning
AI summary

The authors analyze how training a type of neural network called a feed-forward ReLU network works by focusing not on the individual weights but on overall patterns in the training data. For networks with one hidden layer, they show it's possible to describe training dynamics just using residuals and a special kernel that captures both input structure and neuron activity. In deeper networks, these dynamics still follow a clear pattern across layers, but starting from three layers, understanding the process requires linking together multiple mathematical objects that track interactions between weights across layers.

feed-forward ReLU networksgradient descentquadratic lossactivation dynamicsresidualscollective kernelinput geometryco-activation matrixGram operatorslayer-wise structure
Authors
Claudio Nordio
Abstract
We study feed-forward ReLU networks with fixed readout and quadratic loss. The aim is to rewrite gradient descent not primarily as a dynamics in weight space, but as a collective dynamics closed in terms of fields defined on the training-set space. For a single hidden layer, the weight variables can be eliminated from the activation dynamics, yielding a closed equation for the residuals governed by a collective kernel that factorizes into an input-geometric matrix and a dynamical co-activation matrix. For deeper networks, the residual dynamics retains a clean layer-wise kernel structure. However, from depth three onward, closure requires a hierarchy of weight-induced Gram operators that mediate information transport across layers.