AI summaryⓘ
The authors show that for solving partial differential equations (PDEs) with neural networks, having the right built-in assumptions (called inductive biases) matters more than just making the model bigger. They created a new model called WaveLiT that uses wavelets and special attention mechanisms to efficiently understand complex patterns, outperforming much larger models on several physics-inspired test problems. Their smaller WaveLiT models work especially well on wave and sound-related tasks, where their built-in math assumptions fit the data well. The authors also find that looking at where their model does worse gives useful clues about what the design captures or misses. Their approach trains quickly on one GPU, showing small, well-designed models can compete with huge ones in this area.
Neural PDE solversArchitectural inductive biasWavelet transformMulti-resolution tokenizationLinear attentionFeature pyramidWavelet-domain lossTheWell benchmarksDynamical systemsModel scaling
Authors
Shyam Sankaran, Hanwen Wang, Paris Perdikaris
Abstract
Neural PDE solvers have followed the scaling trajectory of vision and language, with recent foundation models reaching billions of parameters. We argue that scale is a poor substitute for architectural inductive bias in this domain: structured priors deliver outsized parameter efficiency, and the pattern of where they succeed and fail is itself informative about what they capture. We instantiate this argument in WaveLiT, an architecture combining a discrete wavelet transform for lossless multi-resolution tokenization, an augmented linear attention block, a shared-weight multiscale feature pyramid, and a wavelet-domain auxiliary loss. Bespoke 1-10M-parameter WaveLiT models compete with foundation models of 100-1000$\times$ their size across eight TheWell benchmarks, with the largest gains on wave and acoustic-dominated benchmarks where the wavelet-multiscale prior fits the dominant dynamical structure and small per-step errors do not compound geometrically under rollout. Trained jointly across all eight benchmarks, a 10M-parameter foundation variant exhibits a structured, physically interpretable transfer pattern -- strongest where the wavelet-multiscale prior matches the dynamics, weakest on chaotic advection-dominated flows. The entire pipeline trains on a single GPU. The results suggest that small-model PDE performance is shaped by architectural inductive bias rather than scale, and that the structure of a prior's failures is a useful empirical signal about its content.