Learning Sparse Compositional Functions with Norm-Constrained Neural Networks

2026-05-25Machine Learning

Machine Learning
AI summary

The authors study why deep neural networks are good at learning complex functions by building them up in layers (hierarchies). They focus on situations where the network has more parameters than data points, measuring complexity by how large the parameters are rather than just counting them. Their theory shows that deep networks can learn special types of functions built from smaller parts arranged like graphs, and they avoid the usual problems caused by high-dimensional data. This applies to many practical models, suggesting deep networks are effective at using the structure of the function they try to learn.

Deep neural networksHierarchical featuresCurse of dimensionalityOverparameterizationParameter normCompositional functionsDirected acyclic graphsFrobenius normExcess riskApproximation rates
Authors
Shuo Huang, Lorenzo Fiorito, Lorenzo Rosasco, Tomaso Poggio
Abstract
The ability of deep neural networks to learn hierarchical features is widely regarded as a key mechanism underlying their success in high-dimensional learning. Existing theory partially supports this view by establishing approximation rates based on parameter counts and sample complexity guarantees for compositional models without incurring the curse of dimensionality (CoD). To study overparameterized regimes, where the number of parameters exceeds the sample size, we develop a framework that measures complexity via the parameter norm. Within this approach, we establish approximation rates and excess risk bounds for learning sparse compositional functions whose compositional structure is represented by directed acyclic graphs (DAGs), using Frobenius norm-constrained deep neural networks. Our results have broad applicability since every function that is efficiently Turing computable admits sparse compositional representations. In particular, we cover a range of representative models, including multi-index models, binary tree structures, and general compositional architectures. The rates we derive show that deep networks can exploit the compositional structure of the target functions, effectively avoiding the CoD through hierarchical representations.