AI summaryⓘ
The authors study how to train language models when data is spread across multiple sites with limited communication capacity, like in hospitals or research groups. They focus on understanding what is statistically possible when the amount of data each site can send is limited, rather than building a ready-to-use system. They analyze two methods: one for training (FPLD) that balances accuracy with data size and communication limits, and one for making predictions (FC-RAG) that factors in how much information can be retrieved from each site. Their mathematical results show how bandwidth affects learning and prediction quality, and simple tests confirm these trends on synthetic and small real data, but a full practical test was not done.
distributed learningfederated learningquantizationbandwidth constraintsKL-consistencymarginal coveragelanguage modelsretrieval-augmented generationconformal predictionstatistical guarantees
Abstract
Training a language model on data scattered across bandwidth-limited nodes that cannot be centralized is a setting that arises in clinical networks, enterprise knowledge bases, and scientific consortia. We study the regime in which data must remain distributed across nodes, and ask what statistical guarantees are in principle achievable under explicit bandwidth budgets; we aim to characterize what is provably possible, not to demonstrate a deployment-ready system. Existing theory treats either training-time consistency or inference-time calibration in isolation, and none makes bandwidth a first-class statistical parameter. We analyze two protocols, Federated Probe-Logit Distillation (FPLD) for training and Federated Conformal RAG (FC-RAG) for inference, as the analytical vehicles for our results. Our first main result is an explicit high-probability KL-consistency rate for FPLD with simultaneous dependence on node count $K$, per-node sample size $n$, quantization budget $B$, probe-set size $m$, and vocabulary size $V$; bandwidth enters only through an exponentially vanishing quantization term. Our second main result is a distribution-free marginal-coverage bound for FC-RAG, whose novel retrieval-bandwidth slack $Δ_{\mathrm{RAG}} = f_{\max}\sqrt{K^{-2}\sum_i v(B_i)}$ makes per-node retrieval bandwidth a first-class statistical parameter, with arithmetic aggregation across $K$ nodes shrinking the slack as $K^{-1/2}$ in the per-node-uniform regime. A Pinsker-type corollary composes the two bounds into an end-to-end coverage guarantee. Synthetic experiments verify the predicted scaling along the bounds' parameters; small-scale experiments on a GPT-2 testbed illustrate that the qualitative bandwidth-accuracy tradeoff survives on a real language model. A deployment-scale empirical evaluation is out of scope.