Quantifying the Energy Floor: Direct Measurement and Replay Buffer Bias in SAC-Based HVAC Control on sbsim
2026-06-01 • Machine Learning
Machine Learning
AI summaryⓘ
The authors studied how well a Soft Actor-Critic (SAC) algorithm can control heating and cooling in a building simulator, focusing on the lowest possible cost given physical limits. They found the minimum cost is about $35.51 per day, mostly from constant electrical use, with little gas use. The usual SAC method, using pre-filled data, ended up a bit more expensive, but starting fresh nearly matched the minimum cost. Changing temperature settings didn’t help much and sometimes broke rules. They also found that how the algorithm plans into the future is shorter than expected, affecting results.
Soft Actor-Critic (SAC)HVAC controlenergy floorbuilding simulatoraction space constraintsreplay bufferdiscount factorplanning horizonminimum powerelectrical loads
Authors
Bo Li, Chen Zhang
Abstract
We quantify the energy floor -- the minimum achievable cost given action space constraints -- for Soft Actor-Critic (SAC) HVAC control on the sbsim calibrated building simulator. Through minimum-action experiments, we directly measure this floor at USD 35.51/day, dominated by continuous electrical loads (USD 35.44, 99.8%) with negligible gas consumption. The standard SAC baseline, initialized with schedule-policy replay buffer transitions, converges to USD 37.18/day, 4.7% above the floor. We identify buffer initialization as the dominant source of sub-optimality in this scenario: training from an empty buffer reduces cost to USD 35.57/day, eliminating 96% of the gap. Expanding the supply water temperature range by 10 K yields negligible additional savings (USD 0.03/day), and further expansion triggers physical constraint violations. We additionally uncover a discount factor coupling (gamma_eff = 0.891) shrinking the effective planning horizon from 8.3 h to 46 min -- a benchmark-wide issue warranting audit. Systematic ablation across planning horizon, reward weights, and observation enrichment confirms all pre-filled-buffer configurations cluster within 0.7% (USD 37.18--USD 37.42), demonstrating that equipment minimum power -- not algorithmic design -- imposes the binding constraint.