Vortex: Programmable Sparse Attention for Agents as Algorithm Designers

★ Highlights

Agents discover sparse attention — and it actually runs faster

AI agents generate and refine diverse sparse-attention algorithms with Vortex, and every one is benchmarked end-to-end in a real serving stack — so these are measured throughput gains, not paper estimates. The headline results:

From idea to deployed algorithm. A researcher or agent writes a flow over Vortex's page-centric tensor abstraction, and it compiles into fused kernels that plug straight into SGLang. The payoff is immediate: across agent-generated variants, the best reaches up to 3.46× the throughput of full attention while preserving accuracy.

The Vortex workflow: papers and agents to vFlow, vTensor, and a serving system — **(a)** A workflow to study sparse attention with Vortex. **(b)** Agent-generated sparse attention (Qwen3-1.7B, AIME, NVIDIA H200) — the best reaches up to **3.46×** the throughput of full attention while preserving accuracy.

Agent-generated sparse attention on the accuracy–throughput plane — **(a)** A workflow to study sparse attention with Vortex. **(b)** Agent-generated sparse attention (Qwen3-1.7B, AIME, NVIDIA H200) — the best reaches up to **3.46×** the throughput of full attention while preserving accuracy.

Many agents, many algorithms. This isn't one hand-tuned kernel. Claude Opus 4.7, Claude Sonnet 4.6, and GPT-5 each generate structurally diverse designs — and after a staged filtering pipeline, the selected ones are efficient: full-attention accuracy at 2–3.1× higher throughput across three benchmarks.

Agent-generated algorithms on RULER — Performance of AI-agent-generated algorithms across **(a)** RULER, **(b)** AMC23, and **(c)** AIME24 — Claude Opus 4.7, Claude Sonnet 4.6, and GPT-5 each produce **diverse** sparse-attention designs; after a staged filtering pipeline, the selected ones are **efficient** — full-attention accuracy at substantially higher throughput.

Agent-generated algorithms on AMC23 — Performance of AI-agent-generated algorithms across **(a)** RULER, **(b)** AMC23, and **(c)** AIME24 — Claude Opus 4.7, Claude Sonnet 4.6, and GPT-5 each produce **diverse** sparse-attention designs; after a staged filtering pipeline, the selected ones are **efficient** — full-attention accuracy at substantially higher throughput.

And it compounds. Given only the framework and a goal, an agent runs a long-horizon loop — proposing, benchmarking, and refining four variants per round. Over 18 hours (23 iterations, 92 submissions) it steadily pushes the accuracy–throughput frontier outward, entirely on its own.

Long-horizon autonomous optimization on AIME24 (23 iterations, 92 submissions): **(a)** mean@16 per iteration, **(b)** throughput per iteration, **(c)** the accuracy–throughput frontier of all submissions, colored by iteration order.

Long-horizon autonomous optimization on AIME24 (23 iterations, 92 submissions): **(a)** mean@16 per iteration, **(b)** throughput per iteration, **(c)** the accuracy–throughput frontier of all submissions, colored by iteration order.

01 · Vision

Agents as algorithm designers

Sparse attention has become a fundamental technique for serving large language models. As generation lengths explode across reasoning, agentic systems, and reinforcement learning, moving the KV cache during decoding — not compute — is the dominant bottleneck. Attending to only the tokens that matter is the way out, and it now appears both as a core architectural choice in frontier models (DeepSeek, GLM) and as a drop-in optimization for pretrained ones.

Yet deploying and evaluating new sparse-attention algorithms at scale, with real end-to-end speedups, has stayed painfully engineering-intensive — slowing both human researchers and the emerging class of AI agents that could explore this design space. Modern serving systems store the KV cache in a non-contiguous, paged, block-sparse layout reached through indirect addressing, which breaks the contiguous-tensor assumptions of frameworks like PyTorch. As a result, a new idea that is a few lines of math on paper can take thousands of lines of kernel and plumbing code to try.

Vortex removes that wall. It is built for autonomous algorithm discovery: an AI agent proposes a sparse-attention idea, expresses it in a few lines of high-level Python, and Vortex compiles it into fused kernels that run inside a production serving stack — then measures real throughput and accuracy. The agent reads the result and refines, closing the research loop without a human in the inner iteration.

Across hundreds of generated variants, agents consistently discover Pareto-efficient algorithms: full-attention–level accuracy at a fraction of the cost. The best reaches up to 3.46× higher throughput than full attention while preserving accuracy — and because every variant is benchmarked in a real serving stack, those gains are measured, not theoretical.

It runs as a closed discovery loop:

1

Express

An agent writes the idea in a few lines of high-level Python ops — scoring, reductions, top-k.

→

2

Deploy

Vortex JIT-compiles it into fused kernels inside a real LLM serving stack — no model-code changes.

→

3

Measure

Real throughput and accuracy come back from an end-to-end benchmark, not an estimate.

↺

4

Refine

The agent reads the result and proposes the next variant — discovery without a human in the inner loop.

02 · Design

A frontend for ideas, a backend for serving

Vortex pairs a Python-embedded frontend over a page-centric tensor abstraction (vTensor) — concise enough to express a broad range of sparse-attention algorithms — with an efficient backend tightly integrated into modern LLM serving stacks (SGLang). The guiding principle is a clean split: you say what sparsity to apply and how attention is computed, and the framework owns the low-level tensor layout and memory management. Theoretical efficiency becomes real-world throughput, without touching core model code.

You describe what to attend to — score pages, reduce per-block summaries, select a top-k — and Vortex handles batching, paged KV caching, gather minimization, and kernel fusion. A flow is written as modular, composable operators (GeMM, Reduce, Top-K, …) over paged tensors, rather than a monolithic custom kernel per algorithm — so new patterns combine instead of being reimplemented from scratch.

Crucially, Vortex treats the dynamic part — deciding the sparsity pattern on the fly — as a first-class, optimized stage, not an afterthought. The same abstraction covers MHA and MLA models, exact and approximate top-k, and even a programmable per-sequence token budget written as a small CUDA snippet.

The result drops into FlashInfer / CUDA Graph / radix-cache decoding — so a new algorithm is deployable and benchmarkable the moment it compiles, and its speedups survive contact with a real serving system.

Under the hood, a flow flows through three composable layers:

▣

`vFlow`

A Python-embedded frontend language. Declare what to attend to; the framework handles batching, caching, and fusion.

▦

`vTensor`

A page-centric tensor abstraction where the page is the unit of sparsity — uniform across MHA and MLA models.

▤

Serving System

A backend tightly integrated with SGLang: FlashInfer kernels, CUDA Graph, and radix-cache decoding.

Together, these close three gaps that have kept sparse attention slow to iterate on:

⚙︎

Dynamic sparsity

Static-pattern kernels (FlashInfer, FlexAttention) optimize attention once the pattern is known. Vortex makes computing the pattern on the fly — the accurate, dynamic case — efficient too.

⌘

Programmability

Adding a variant to a serving system can mean ~2000 lines of code re-implementing GeMM/Reduce/Top-K over paged tensors. In Vortex it is a few composable lines.

🔌

Serving compatibility

Custom-kernel methods often break paged attention or prefix caching — Quest's original code is 44.4× slower than full attention. Vortex stays native to the stack.

03 · Experiments

Discovery, scale, and emerging architectures

Vortex accelerates sparse-attention research along three axes: autonomous discovery, reach into emerging architectures and very large models, and use as a research instrument for understanding where the routing signal lives.

①

Autonomous discovery

Agents generate and refine diverse algorithms that are consistently Pareto-efficient.

up to 3.46× throughput

②

Scale & architectures

Sparse attention extended to MLA models and very large models that are otherwise hard to experiment with.

4.7× GLM-4.7 · 1.37× 229B

③

Research instrument

A lens on sparse attention itself — pinpointing where the routing signal lives.

interpretability

Autonomous discovery is shown in the Highlights above; here we focus on reach — new architectures and the largest models:

Sparse attention on the MLA-based GLM-4.7-Flash

MLA models

GLM-4.7-Flash · up to 4.7×

Three MLA sparse-attention flows (rope-aware / rope-unaware block-sparse, and Quest) expressed in vFlow and swept over block sizes on AIME26 with 32K-token generation — extending sparse attention to an architecture that is otherwise hard to experiment with (NVIDIA B200).

Sparse attention on the 229B MiniMax-M2.7 with tensor parallelism

Scaling · 229B

MiniMax-M2.7 · up to 1.37×

The same flows scale to a 229B-parameter model under tensor parallelism (TP=4, four NVIDIA B200 GPUs) on AIME26 — sparse attention staying practical at the largest scales, with full-attention accuracy preserved.

Model	Benchmark	Hardware	Throughput vs. full attn	Accuracy
Qwen3-1.7B agent-discovered	AIME24	H200	↑ 3.46×	matched (38.96 vs 38.54)
GLM-4.7-Flash MLA	AIME26 · 32K	B200	↑ 4.7×	matched (mean@16 ≈ 0.75)
Qwen3-30B-A3B MoE, FP8	AIME24 · 32K	B200	↑ 1.63×	matched (0.802 vs 0.80)
MiniMax-M2.7 229B, TP=4	AIME26 · 32K	4× B200	↑ 1.37×	≥ full (0.84 vs 0.83)

Best operating point per model; “throughput vs. full attn” is end-to-end decode throughput at matched-or-better accuracy. Block top-k also reaches up to 3.60× server throughput and 11.7–12.8× lower P95 latency at high request rates.

Because every algorithm is benchmarked end-to-end in a real serving stack, these are measured throughput gains, not paper estimates. Beyond raw numbers, Vortex doubles as a research instrument: agents produce structurally diverse algorithms, and controlled ablations using the same abstraction localize where the routing signal lives — a small set of query–key channel groups turns out to carry most of the routing information across model sizes.

Get started

Install in a minute

Install

git clone --recursive https://github.com/Infini-AI-Lab/vortex_torch.git
cd vortex_torch

# SGLang dependency (vendored)
cd third_party/sglang/v0.5.9/sglang
pip install -e "python"
cd ../../../../

# Vortex
pip install -e .

Cite

@misc{chen2026vortexefficientprogrammablesparse,
  title  = {Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents},
  author = {Zhuoming Chen and Xinrui Zhong and Qilong Feng and Ranajoy Sadhukhan and
            Yang Zhou and Michael Qizhe Shieh and Zhihao Jia and Beidi Chen},
  year   = {2026},
  eprint = {2606.06453},
  archivePrefix = {arXiv},
  primaryClass  = {cs.AI},
  url    = {https://arxiv.org/abs/2606.06453}
}