How Small Can 6G Reason? Scaling Tiny Language Models for AI-Native Networks
2026-03-02 • Networking and Internet Architecture
Networking and Internet ArchitectureArtificial Intelligence
AI summaryⓘ
The authors studied how different sizes of language models perform in AI-based 6G networks, focusing on their speed, memory use, and accuracy in making decisions. They found that very large models, while more accurate, are less practical for fast, edge-based tasks because they use too many resources. Smaller to mid-sized models around 1.5 to 3 billion parameters hit the best balance between accuracy, speed, and efficiency for these networks. Their research helps guide which models to choose when building smarter 6G systems. They also created a benchmark with 30 tasks to test these models and shared their results publicly.
6G networksAI-native systemslarge language modelssemantic reasoningedge computingmodel scalingbenchmarkinference latencymemory footprintdeterministic accuracy
Authors
Mohamed Amine Ferrag, Abderrahmane Lakas, Merouane Debbah
Abstract
Emerging 6G visions, reflected in ongoing standardization efforts within 3GPP, IETF, ETSI, ITU-T, and the O-RAN Alliance, increasingly characterize networks as AI-native systems in which high-level semantic reasoning layers operate above standardized control and data-plane functions. Although frontier-scale large language models (LLMs) such as Qwen2.5-7B and Olmo-3-7B demonstrate strong reasoning capability, their computational footprint limits deployment in latency-sensitive, edge-native infrastructures. This paper presents a systematic empirical study of the scaling behavior and deployment efficiency of compact language models for network-level semantic reasoning in AI-native 6G systems. Using 6G-Bench, a standardization-aligned benchmark comprising 30 decision-making tasks across five capability domains, we evaluate models ranging from 135M (SmolLM2-135M) to 7B parameters (Qwen2.5-7B), including mid-scale architectures such as Llama-3.2-1B, Granite-1B, and Qwen2.5-3B. Deterministic accuracy (pass@1) increases from 0.224 at 135M to 0.707 at 7B, but scaling gains are highly non-uniform. A pronounced stability transition occurs in the 1 to 1.5B range, where accuracy rises from 0.373 (Llama-3.2-1B) to 0.531 (Qwen2.5-1.5B) and the instability gap Delta_5 contracts from 0.356 to 0.138. Beyond 3B parameters, improvements diminish (+0.064 from 3B to 7B). Through single-query inference profiling and an Edge Score metric that normalizes accuracy by latency and memory footprint, we show that semantic reliability per unit edge resource does not scale monotonically with parameter count. Instead, mid-scale models (approximately 1.5 to 3B) achieve the most favorable balance between deterministic stability and computational efficiency, providing deployment-relevant guidance for AI-native 6G architectures. All scripts and results are publicly available at https://github.com/maferrag/6G-Bench