Between Zeros and Ones: Behavioral Characterization Beyond Binary Labeling Across Public ICS Datasets
2026-06-29 • Cryptography and Security
Cryptography and Security
AI summaryⓘ
The authors point out that current methods for detecting attacks in Industrial Control Systems only label events as 'normal' or 'attack,' which hides the different types of attack behaviors. They created a new way to describe attack behavior using five simple patterns: drift, spike, oscillation, repetition, and switching. By testing on three popular datasets, they found each had unique behavior patterns during attacks, and using only a simple attack/normal label misses these details. Their tests showed that evaluating detection using these behavior patterns reveals weaknesses that usual scoring hides. They suggest adding this behavior-based evaluation to better understand and respond to attacks.
Industrial Control Systems (ICS)Intrusion DetectionCyber-Physical AttacksBehavioral CharacterizationMultivariate Process TracesSWaT DatasetWADI DatasetHAI DatasetRandom ForestF1 Score
Authors
Konstantinos E. Kampourakis, Vyron Kampourakis, Georgios Spathoulas, Constantinos Kolias
Abstract
Intrusion detection in Industrial Control Systems (ICS) is typically evaluated on a small set of public benchmarks using binary ``normal'' versus ``attack'' labels, a practice that can mask the behavioral diversity of cyber-physical attacks. To address this limitation, we propose a behavioral characterization framework that maps raw multivariate process traces into five interpretable physical primitives: drift, spike, oscillation, repetition, and switching. We apply the framework to three widely used ICS benchmarks, namely, SWaT, WADI, and HAI, and show that attack windows exhibit clear behavioral shifts relative to normal operation while the three datasets occupy largely distinct regions of the behavioral space, revealing both cross-dataset bias and intra-dataset diversity. In particular, WADI is dominated by repetition, HAI emphasizes sustained drift and oscillation, and SWaT is characterized by stealthier frozen-telemetry behavior. To examine the evaluation implications, we use an indicative Random Forest baseline and show that aggregate binary metrics can limit visibility into performance across different behavioral proxies. For example, in SWaT, macro F1 drops from 85.44% under binary evaluation to 37.84% under behavior-proxy multiclass prediction, with similar degradations observed on WADI and HAI. Based on these findings, we argue for complementing conventional binary benchmarking with behavior-stratified evaluation to expose blind spots that aggregate scores leave hidden and to better support targeted incident response.