Contrastive Learning and Correlation Clustering for Sequences of Network Telescope Data
2026-06-03 • Machine Learning
Machine LearningNetworking and Internet Architecture
AI summaryⓘ
The authors study how to understand the behavior of Internet scanners by finding meaningful connections between patterns of network activity. They use a transformer model trained with contrastive learning to compare sequences of network flow records, without needing any labels or pretraining. Their method groups similar sequences together, mostly matching known scanner sources, even for new unseen data. This approach helps identify which network activities come from the same scanner source. The authors also made their code publicly available for others to use.
Internet scannersnetwork flow recordscontrastive learningtransformer modelcorrelation clusteringsemantic similarityunsupervised learningnetwork securityembeddingclustering algorithms
Authors
Jannik Presberger, Alexander Männel, Maynard Koch, Thomas C. Schmidt, Matthias Wählisch, Bjoern Andres
Abstract
Understanding activities of Internet scanners is challenging; it often requires identifying relationships between sources, a task for which semantic annotations are scarce. This work investigates whether semantically meaningful pairwise relationships between sequences of network flow records can be estimated by contrastive learning, without pretraining and without annotations. To this end, we propose a transformer model that embeds minimally preprocessed sequences of network flow records and train it using contrastive learning. With the similarities obtained from this model, we state a correlation clustering problem and solve it locally. Experimentally, we show: Learned similarities are higher on average for sequences originating from the same source than for sequences originating from different sources, and this property generalizes to unseen sequences of unseen sources. Moreover, correlation clustering yields clusters consistent with scanner labels. The complete source code of the algorithms and for reproducing the experiments is publicly available.