# Quick Start A working setup is **two files**: 1. **The flow module** — a `.py` file that *defines* your sparse-attention algorithm as a `vFlow` subclass and `@register`s it under a name. It contains only Vortex ops; it never imports SGLang. 2. **The launch script** — imports `sglang` + `vortex_torch` and starts the engine, pointing at the flow by its registered name. ## 1. Define a flow — `custom_sparse_attention.py` A `vFlow` declares its cache layout in `create_cache`, refreshes per-page state in `forward_cache`, and scores/selects pages every decode step in `forward_indexer`. Save this anywhere on disk: ```python from typing import Dict import torch from vortex_torch.flow import vFlow, register from vortex_torch.indexer import GeMM, Mean, topK from vortex_torch.cache import Mean as CMean from vortex_torch.abs import ContextBase @register("custom_sparse_attention") class CustomSparseAttention(vFlow): def __init__(self): super().__init__() # Indexer-side ops (run every decode step) self.mean = Mean(dim=1) # average over the query heads self.gemm = GeMM() # GeMM(x, y) = y @ xᵀ self.output_func = topK() # must end in topK / approxTopK # Cache-side ops (run once per finished page) self.reduction = CMean(dim=1) # one centroid (mean key) per page def forward_indexer(self, q, o, cache: Dict[str, torch.Tensor], ctx: ContextBase): # No native torch ops here — every tensor flows through Vortex ops. q_mean = self.mean(q, ctx=ctx) # [B, 1, D] score = self.gemm(q_mean, cache["centroids"], ctx=ctx) # [S, 1, 1] self.output_func(score, o, ctx=ctx) # selected pages -> o def forward_cache(self, cache: Dict[str, torch.Tensor], loc, ctx: ContextBase): # triggered only when a page is finished self.reduction(cache["k"], cache["centroids"], loc=loc, ctx=ctx) def create_cache(self, block_size: int, head_dim: int): # "k" and "v" are provided automatically — do not declare them return {"centroids": (1, head_dim)} ``` ## 2. Launch SGLang with your flow The launch script is a **separate file**. Importing `vortex_torch` is what wires Vortex into SGLang (it installs the `ServerArgs` ↔ `VortexConfig` adapter), so the import is required even though you don't call it directly. Every Vortex knob lives in a single [`VortexConfig`](examples.md); passing it turns sparsity **on**. ```python import sglang as sgl import vortex_torch # noqa: F401 — installs the VortexConfig adapter from vortex_torch.engine.sgl.config import VortexConfig llm = sgl.Engine( model_path="Qwen/Qwen3-0.6B", page_size=16, attention_backend="flashinfer", # mandatory: SGLang's base backend disable_overlap_schedule=True, # mandatory for Vortex sparsity mem_fraction_static=0.85, vortex=VortexConfig( module_path="path/to/custom_sparse_attention.py", module_name="custom_sparse_attention", # the @register name topk_val=30, # pages kept per query layers_skip=[0], # layer 0 runs full/dense attention block_reserved_bos=1, # always keep the first page (sink) block_reserved_eos=1, # always keep the most-recent page max_seq_lens=8192, ), ) ``` ## 3. Generate ```python out = llm.generate( ["The capital of France is"], {"temperature": 0.0, "max_new_tokens": 32}, ) print(out[0]["text"]) ``` That's the whole loop: write a flow, point the engine at it, generate — with the sparse attention running as fused kernels inside SGLang. Next, see [Examples](examples.md) for `VortexConfig` in full, a programmable per-sequence budget, MLA models, and serving over HTTP.