Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents

2026-06-04Artificial Intelligence

Artificial Intelligence
AI summary

The authors present Vortex, a system that makes it easier and faster to create, test, and use different sparse attention methods for large language models. Sparse attention helps models handle long text by focusing only on important parts, but trying new methods is usually difficult and slow. Vortex provides a simple way to build and run these methods efficiently, leading to real speed improvements without losing accuracy. It also works well with very large models and newer architectures, significantly boosting their processing speed.

sparse attentionlarge language modelstensor abstractionthroughputMLA-based GLM-4.7-FlashMiniMax-M2.7NVIDIA B200 GPUsdeep learningmodel servingalgorithm prototyping
Authors
Zhuoming Chen, Xinrui Zhong, Qilong Feng, Ranajoy Sadhukhan, Yang Zhou, Michael Qizhe Shieh, Zhihao Jia, Beidi Chen
Abstract
Sparse attention is becoming increasingly important for serving large language models (LLMs) as generation lengths continue to grow. However, deploying and evaluating new sparse attention algorithms at scale remains highly engineering-intensive, slowing both human researchers and AI agents in exploring the sparse attention design. To address this challenge, we present Vortex, a system that combines a Python-embedded frontend language atop a page-centric tensor abstraction for expressing a broad range of sparse attention algorithms, with an efficient backend tightly integrated into modern LLM serving stacks. Vortex enables rapid prototyping, deployment, and evaluation of sparse attention algorithms, effectively translating their theoretical efficiency gains into real-world throughput improvements. As a result, Vortex substantially accelerates the design and iteration of sparse attention algorithms. First, AI agents use Vortex to automatically generate and refine diverse algorithms, the best reaching up to $3.46\times$ higher throughput than full attention while preserving accuracy. Second, Vortex extends sparse attention to emerging architectures and very large models that are otherwise hard to experiment with, reaching up to $4.7\times$ higher throughput on the MLA-based GLM-4.7-Flash and $1.37\times$ on the 229B-parameter MiniMax-M2.7 on NVIDIA B200 GPUs.