Multimodal Reasoning with LLM for Encrypted Traffic Interpretation: A Benchmark

2026-04-09 • Cryptography and Security

Cryptography and SecurityArtificial IntelligenceMultimediaNetworking and Internet Architecture

AI summaryⓘ

The authors address problems in analyzing network traffic data, which is important for internet security but hard to interpret beyond simple categorization. They note that existing datasets lack detailed explanations and semantic information, so they created a new dataset called Byte-Grounded Traffic Description (BGTD) that pairs raw traffic data with expert annotations. Using BGTD, the authors developed a model called mmTraffic that combines both data encoding and language understanding to provide clear, human-readable reports explaining traffic behavior. Their approach improves interpretability without losing accuracy compared to traditional models. The work aims to make encrypted traffic analysis more explainable and trustworthy.

network trafficmultimodal reasoningencrypted trafficsemantic annotationlarge language model (LLM)traffic classificationexplainable AItraffic encodingperception-cognition architecture

Authors

Longgang Zhang, Xiaowei Fu, Fuxiang Huang, Lei Zhang

Abstract

Network traffic, as a key media format, is crucial for ensuring security and communications in modern internet infrastructure. While existing methods offer excellent performance, they face two key bottlenecks: (1) They fail to capture multidimensional semantics beyond unimodal sequence patterns. (2) Their black box property, i.e., providing only category labels, lacks an auditable reasoning process. We identify a key factor that existing network traffic datasets are primarily designed for classification and inherently lack rich semantic annotations, failing to generate human-readable evidence report. To address data scarcity, this paper proposes a Byte-Grounded Traffic Description (BGTD) benchmark for the first time, combining raw bytes with structured expert annotations. BGTD provides necessary behavioral features and verifiable chains of evidence for multimodal reasoning towards explainable encrypted traffic interpretation. Built upon BGTD, this paper proposes an end-to-end traffic-language representation framework (mmTraffic), a multimodal reasoning architecture bridging physical traffic encoding and semantic interpretation. In order to alleviate modality interference and generative hallucinations, mmTraffic adopts a jointly-optimized perception-cognition architecture. By incorporating a perception-centered traffic encoder and a cognition-centered LLM generator, mmTraffic achieves refined traffic interpretation with guaranteed category prediction. Extensive experiments demonstrate that mmTraffic autonomously generates high-fidelity, human-readable, and evidence-grounded traffic interpretation reports, while maintaining highly competitive classification accuracy comparing to specialized unimodal model (e.g., NetMamba). The source code is available at https://github.com/lgzhangzlg/Multimodal-Reasoning-with-LLM-for-Encrypted-Traffic-Interpretation-A-Benchmark

View PDFOpen arXiv