Tight Sample Complexity of Transformers

2026-06-08Machine Learning

Machine Learning
AI summary

The authors studied how complex depth-L Transformers are by measuring their VC dimension, which relates to how well they can learn from data. They found tight upper and lower bounds on this complexity based on the number of parameters (W), layers (L), and input length (T). They also analyzed learning when the Transformer predicts step-by-step reasoning (chain-of-thought) and determined the number of samples needed for learning under teacher forcing and general learning scenarios. Their results show how the architecture and sequence lengths influence learning difficulty in a precise way.

VC dimensionTransformersdepth-L networkssample complexitychain-of-thought learningteacher forcingautoregressive modelsparameter countlearning theoryinput sequence length
Authors
Chenxiao Yang, Nathan Srebro, Zhiyuan Li
Abstract
We tightly characterize the VC dimension of depth-$L$ Transformers with a total of $W$ parameters, mapping an input sequence of length $T$ to a single output, establishing an upper bound of $O(L W \log (T W))$ and a nearly matching lower bound of $Ω(L W \log (T W / L))$. We further tightly characterize the sample complexity of chain-of-thought learning using such a Transformer, showing teacher forcing (i.e. selecting a predictor consistent with the entire chain-of-thought on training data) learns with sample complexity $O\left(L W \log \left(\left(T+T^{\prime}\right) W\right)\right)$ and that any learning rule that uses chain-of-thought data requires at least $Ω\left(L W \log \left(\left(T+T^{\prime}\right) W / L\right)\right)$ examples, where $T$ is the input length and $T^{\prime}$ is the number of autoregressive steps.