Heads, Not Backbones: Output Heads Dominate Architectures on Fat-Tailed Returns

2026-06-29Machine Learning

Machine Learning
AI summary

The authors studied how different parts of a forecasting model affect its ability to predict financial returns, especially those with extreme changes (fat tails). They found that the type of output layer (the 'head')—which models uncertainty—has a bigger impact on prediction accuracy at short time horizons than the choice of core model architecture ('backbone'). Using a mixture of Gaussian distributions improves accuracy more than simpler heads, particularly during volatile periods like crises. However, at longer horizons, the backbone matters more. Overall, modeling complex risk in the output layer is key for short-term financial predictions.

fat-tailed returnsfinancial forecastingbackbone architectureoutput headGaussian mixture modelCRPS (Continuous Ranked Probability Score)anchored walk-forward validationvolatility regimesforecast horizonrisk management
Authors
Sichao He, Yansong Zhang
Abstract
In a deep forecasting pipeline for fat-tailed financial returns at short horizons, which matters more - the backbone architecture or the output head? We compare four modern backbones (TimesNet, DLinear, N-BEATS, iTransformer) under three output heads: a point head, a single-Gaussian density head, and a Gaussian mixture density head with K=4 components. On S and P 500 monthly log-returns (1871-2023) under anchored walk-forward validation, the three heads form a strict gradient: switching from point to Gaussian improves CRPS by about 1.3 percent; switching from Gaussian to mixture adds a further about 2.4 percent. Switching between backbones, in contrast, changes CRPS by less than 1.5 percent on the point-head row and on the backbone-mean axis; density-head backbone spread is larger (up to 5.1 percent on the h=1 Gaussian row, driven by N-BEATS) but the head gradient (3.7 percentage points) still dominates. The Model Confidence Set on squared errors does not exclude any of the 12 variants at the 5 percent level: the head separates them only on distributional metrics (CRPS, pinball, coverage), not on squared error. The mixture head incremental value over a single Gaussian is largest in the highest-volatility regimes (13.9 percent in 1970s stagflation at h=12), confirming the mixture captures tail risk beyond what a unimodal Gaussian can express. The picture is horizon-dependent: the head dominates at short horizons, but at long horizons (h >= 6) the backbone re-takes the lead - an h-split we document against classical baselines (section 5.1). We conclude that on fat-tailed returns at short horizons, the head dominates the backbone, and the mixture distribution adds genuine value over a single Gaussian during crisis periods when risk-management decisions actually matter.