Quick Start

A working setup is two files:

  1. The flow module — a .py file that defines your sparse-attention algorithm as a vFlow subclass and @registers it under a name. It contains only Vortex ops; it never imports SGLang.

  2. The launch script — imports sglang + vortex_torch and starts the engine, pointing at the flow by its registered name.

1. Define a flow — custom_sparse_attention.py

A vFlow declares its cache layout in create_cache, refreshes per-page state in forward_cache, and scores/selects pages every decode step in forward_indexer. Save this anywhere on disk:

from typing import Dict
import torch

from vortex_torch.flow import vFlow, register
from vortex_torch.indexer import GeMM, Mean, topK
from vortex_torch.cache import Mean as CMean
from vortex_torch.abs import ContextBase


@register("custom_sparse_attention")
class CustomSparseAttention(vFlow):

    def __init__(self):
        super().__init__()
        # Indexer-side ops (run every decode step)
        self.mean = Mean(dim=1)        # average over the query heads
        self.gemm = GeMM()             # GeMM(x, y) = y @ xᵀ
        self.output_func = topK()      # must end in topK / approxTopK
        # Cache-side ops (run once per finished page)
        self.reduction = CMean(dim=1)  # one centroid (mean key) per page

    def forward_indexer(self, q, o, cache: Dict[str, torch.Tensor], ctx: ContextBase):
        # No native torch ops here — every tensor flows through Vortex ops.
        q_mean = self.mean(q, ctx=ctx)                          # [B, 1, D]
        score = self.gemm(q_mean, cache["centroids"], ctx=ctx)  # [S, 1, 1]
        self.output_func(score, o, ctx=ctx)                     # selected pages -> o

    def forward_cache(self, cache: Dict[str, torch.Tensor], loc, ctx: ContextBase):
        # triggered only when a page is finished
        self.reduction(cache["k"], cache["centroids"], loc=loc, ctx=ctx)

    def create_cache(self, block_size: int, head_dim: int):
        # "k" and "v" are provided automatically — do not declare them
        return {"centroids": (1, head_dim)}

2. Launch SGLang with your flow

The launch script is a separate file. Importing vortex_torch is what wires Vortex into SGLang (it installs the ServerArgsVortexConfig adapter), so the import is required even though you don’t call it directly. Every Vortex knob lives in a single VortexConfig; passing it turns sparsity on.

import sglang as sgl
import vortex_torch  # noqa: F401 — installs the VortexConfig adapter
from vortex_torch.engine.sgl.config import VortexConfig

llm = sgl.Engine(
    model_path="Qwen/Qwen3-0.6B",
    page_size=16,
    attention_backend="flashinfer",   # mandatory: SGLang's base backend
    disable_overlap_schedule=True,    # mandatory for Vortex sparsity
    mem_fraction_static=0.85,
    vortex=VortexConfig(
        module_path="path/to/custom_sparse_attention.py",
        module_name="custom_sparse_attention",  # the @register name
        topk_val=30,            # pages kept per query
        layers_skip=[0],        # layer 0 runs full/dense attention
        block_reserved_bos=1,   # always keep the first page (sink)
        block_reserved_eos=1,   # always keep the most-recent page
        max_seq_lens=8192,
    ),
)

3. Generate

out = llm.generate(
    ["The capital of France is"],
    {"temperature": 0.0, "max_new_tokens": 32},
)
print(out[0]["text"])

That’s the whole loop: write a flow, point the engine at it, generate — with the sparse attention running as fused kernels inside SGLang. Next, see Examples for VortexConfig in full, a programmable per-sequence budget, MLA models, and serving over HTTP.