Quick Start¶
A working setup is two files:
The flow module — a
.pyfile that defines your sparse-attention algorithm as avFlowsubclass and@registers it under a name. It contains only Vortex ops; it never imports SGLang.The launch script — imports
sglang+vortex_torchand starts the engine, pointing at the flow by its registered name.
1. Define a flow — custom_sparse_attention.py¶
A vFlow declares its cache layout in create_cache, refreshes per-page state
in forward_cache, and scores/selects pages every decode step in
forward_indexer. Save this anywhere on disk:
from typing import Dict
import torch
from vortex_torch.flow import vFlow, register
from vortex_torch.indexer import GeMM, Mean, topK
from vortex_torch.cache import Mean as CMean
from vortex_torch.abs import ContextBase
@register("custom_sparse_attention")
class CustomSparseAttention(vFlow):
def __init__(self):
super().__init__()
# Indexer-side ops (run every decode step)
self.mean = Mean(dim=1) # average over the query heads
self.gemm = GeMM() # GeMM(x, y) = y @ xᵀ
self.output_func = topK() # must end in topK / approxTopK
# Cache-side ops (run once per finished page)
self.reduction = CMean(dim=1) # one centroid (mean key) per page
def forward_indexer(self, q, o, cache: Dict[str, torch.Tensor], ctx: ContextBase):
# No native torch ops here — every tensor flows through Vortex ops.
q_mean = self.mean(q, ctx=ctx) # [B, 1, D]
score = self.gemm(q_mean, cache["centroids"], ctx=ctx) # [S, 1, 1]
self.output_func(score, o, ctx=ctx) # selected pages -> o
def forward_cache(self, cache: Dict[str, torch.Tensor], loc, ctx: ContextBase):
# triggered only when a page is finished
self.reduction(cache["k"], cache["centroids"], loc=loc, ctx=ctx)
def create_cache(self, block_size: int, head_dim: int):
# "k" and "v" are provided automatically — do not declare them
return {"centroids": (1, head_dim)}
2. Launch SGLang with your flow¶
The launch script is a separate file. Importing vortex_torch is what wires
Vortex into SGLang (it installs the ServerArgs ↔ VortexConfig adapter), so the
import is required even though you don’t call it directly. Every Vortex knob lives
in a single VortexConfig; passing it turns sparsity on.
import sglang as sgl
import vortex_torch # noqa: F401 — installs the VortexConfig adapter
from vortex_torch.engine.sgl.config import VortexConfig
llm = sgl.Engine(
model_path="Qwen/Qwen3-0.6B",
page_size=16,
attention_backend="flashinfer", # mandatory: SGLang's base backend
disable_overlap_schedule=True, # mandatory for Vortex sparsity
mem_fraction_static=0.85,
vortex=VortexConfig(
module_path="path/to/custom_sparse_attention.py",
module_name="custom_sparse_attention", # the @register name
topk_val=30, # pages kept per query
layers_skip=[0], # layer 0 runs full/dense attention
block_reserved_bos=1, # always keep the first page (sink)
block_reserved_eos=1, # always keep the most-recent page
max_seq_lens=8192,
),
)
3. Generate¶
out = llm.generate(
["The capital of France is"],
{"temperature": 0.0, "max_new_tokens": 32},
)
print(out[0]["text"])
That’s the whole loop: write a flow, point the engine at it, generate — with the
sparse attention running as fused kernels inside SGLang. Next, see
Examples for VortexConfig in full, a programmable per-sequence
budget, MLA models, and serving over HTTP.