Vortex

Vortex turns sparse-attention algorithm design into something AI agents can do. You describe a sparse-attention flow in a few lines of high-level Python ops, and Vortex compiles it into fused Triton/CUDA kernels that plug straight into SGLang’s decode loop — no manual kernel writing, and the result runs (and is benchmarked) inside a real serving stack.

A flow is just three methods on a vFlow:

  • create_cache — declare the auxiliary per-page state you want to keep (e.g. a centroid, a min/max envelope) alongside the K/V cache.

  • forward_cache — fill that state from the keys/values as each page completes (runs once per page).

  • forward_indexer — score the cached pages against the query and emit the sparse set of pages to attend to (runs every decode step).

See Quick Start for the shortest end-to-end path, Examples for detailed recipes (custom flows, VortexConfig, a programmable budget, MLA models, and server mode), and the API reference for the full op set.

Note

Looking for the big picture and benchmarks? See the project page and the paper.