FAME: Forecastability-Aware Mixture of Experts for Heterogeneous Time Series Forecasting

2026-06-08 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors address the challenge of forecasting many different types of sales data that vary a lot in patterns and behaviors. They propose a system called FAME that learns which forecasting methods (experts) work best for certain kinds of time series by analyzing their characteristics. Their approach activates only a few experts per series, improving accuracy while keeping costs low. Tested on a large real-world vending machine sales dataset, FAME reduced prediction errors compared to a single strong model, showing expert suitability changes with data type. This method turns picking models into a data-driven process based on forecastability patterns.

forecastingtime seriesmixture of expertsforecastabilitymodel routingsales datamachine learningensemble methodsdemand forecastingvalidation performance

Authors

Qianyang Li, Xingjun Zhang, Shaoxun Wang, Tao Peng, Jia Wei

Abstract

Large-scale retail and industrial forecasting systems contain many heterogeneous time series whose lifecycle, sparsity, volatility, seasonality, spectral patterns, and contextual sensitivity differ substantially. A single forecasting model rarely performs well across all regimes, while dense ensembles increase inference cost and provide limited insight into expert suitability. This paper studies forecastability-aware expert routing: learning how data characteristics determine the suitability of forecasting experts. We propose \method{}, a sparse mixture-of-experts framework that represents each series with a multidimensional forecastability fingerprint, mines expert-suitability targets from validation performance, and trains a cost-aware sparse router to activate a small budgeted set of experts for each series. Using a production-scale vending-machine sales dataset from Shandong New Beiyang (SNBC), where the forecasting component has been integrated into the replenishment-planning pipeline, together with public retail benchmarks, we show that expert suitability varies systematically across data regimes. On the industrial dataset with 5,000+ machines and 60M+ transactions, \method{} Top-2 reduces MSE by 12.4\% over the strongest single expert, LightGBM, while executing 1.92 experts per series on average. The deployed component produces demand forecasts, while inventory-oriented gains are estimated by an offline replay simulator under a fixed replenishment policy rather than by online intervention. The framework turns heterogeneous sales forecasting from heuristic model selection into data mining of forecastability patterns and expert specialization. Code is available at https://github.com/hit636/FAME

View PDFOpen arXiv