Examples¶
This page collects the detailed recipes: configuring a run with VortexConfig,
writing a programmable per-sequence budget, targeting MLA models, and serving
over HTTP. If you haven’t yet, start with the Quick Start.
Anatomy of a flow¶
Every flow is a vFlow with three methods, written entirely from
vortex_torch.indexer.* / vortex_torch.cache.* ops (no native torch):
Method |
Runs |
Job |
|---|---|---|
|
once at setup |
Declare auxiliary per-page state (e.g. centroids). |
|
once per finished page |
Reduce keys/values into that state (cross-block reductions belong on the indexer side). |
|
every decode step |
Score cached pages from the query and emit the selected pages. Must end in |
See vortex_torch.indexer and vortex_torch.cache for the full op
set, and vortex_torch.flow.algorithms for ready-made flows (block-sparse,
Quest-style envelopes, LServe sub-block centroids, …).
What is VortexConfig?¶
VortexConfig (vortex_torch.engine.sgl.config) is a single dataclass that
holds every Vortex hyper-parameter in one place, instead of ~18 loose
vortex_* arguments scattered across SGLang’s ServerArgs. Its presence on the
engine is also the on/off switch: pass a VortexConfig and sparsity is enabled;
leave it out and the model runs ordinary dense attention.
from vortex_torch.engine.sgl.config import VortexConfig
cfg = VortexConfig(
module_name="custom_sparse_attention",
topk_val=30,
layers_skip=[0],
)
llm = sgl.Engine(model_path="Qwen/Qwen3-0.6B", attention_backend="flashinfer",
disable_overlap_schedule=True, vortex=cfg)
Every field, with what it controls and an example value:
Field |
Explanation |
Example |
|---|---|---|
|
Path to your flow’s |
|
|
The |
|
|
Static page budget — fixed minimum pages each sequence keeps. The core accuracy↔throughput knob. |
|
|
Dynamic page budget — a fraction of the sequence’s pages; engine keeps |
|
|
Upper bound on the selected-page count, used to size/pick the top-k kernel. |
|
|
Layer indices that bypass sparse attention and run dense. |
|
|
Pages at the start always selected (attention sink). Int ≥ 1. |
|
|
Pages at the end (most recent) always selected. Int ≥ 1. |
|
|
Maximum sequence length to plan buffers for. |
|
|
Vortex page size (unit of sparsity). Power of 2; defaults to SGLang’s |
|
|
Planner granularity — blocks grouped into one indexer workload. Power of 2. |
|
|
dtype for intermediate indexer tensors. |
|
|
Directory for the JIT kernel cache. |
|
|
A CUDA C++ snippet computing each sequence’s page budget (see below). |
|
|
Sparse-attention kernel family: |
|
|
Indexer op implementation backend: |
|
|
Tensor-core ( |
|
Tip
Budget recap: pages attended per sequence ≈
min(num_pages, max(topk_val + bos + eos, topk_ratio × num_pages)).
topk_val dominates short sequences, topk_ratio long ones.
The legacy flat form —
sgl.Engine(enable_vortex_sparsity=True, vortex_topk_val=30, vortex_module_name=..., ...)
— still works (the adapter folds those vortex_* kwargs into a VortexConfig),
but the explicit object is clearer and self-documenting.
Programmable budget — the schedule_policy¶
Instead of a fixed formula, the per-sequence page budget can be computed by a
CUDA C++ snippet you provide. Vortex injects it as the body of a __device__
function, JIT-compiles it into the decode planner (cached by content hash), and
runs it for every sequence on every backend. The default body is the standard
budget formula:
// default schedule_policy — returns the number of pages to attend to.
const int static_kv_budget = topk_val + block_reserved_bos + block_reserved_eos;
const int dynamic_kv_budget = int(cached_block_len * topk_ratio);
return max(static_kv_budget, dynamic_kv_budget);
The snippet must return an int. In scope: cached_block_len (the sequence’s
length in pages), topk_val, topk_ratio, block_reserved_bos,
block_reserved_eos. Because it’s real device code, you can express budgets the
scalar knobs can’t — e.g. a length-adaptive budget that grows slowly and caps:
vortex=VortexConfig(
module_name="custom_sparse_attention",
topk_val=32,
schedule_policy=r"""
// base budget + 1 extra page per 64 cached pages, capped at 256
const int base = topk_val + block_reserved_bos + block_reserved_eos;
const int extra = cached_block_len / 64;
return min(base + extra, 256);
""",
)
The planner is JIT-compiled once per distinct snippet, so there’s no per-step overhead.
MLA models (DeepSeek-V3 / GLM-4.7 / Kimi-style)¶
Models with Multi-head Latent Attention compress the KV cache into a single
shared low-rank latent. Vortex supports them with a parallel base class,
vFlowMLA:
The cache exposes one auto-provided field,
cache["latent"](the fused[ kv_c | k_pe ]) — there is no"k"/"v".create_cache(block_size, kv_lora_rank, qk_rope_head_dim)declares only your aux tensors.forward_indexerreceives the fused absorbed query[ q_nope_out | q_pe ]; a single dot⟨q, centroid⟩equals the full decode logit (RoPE included).
from typing import Dict
import torch
from vortex_torch.flow import vFlowMLA, register
from vortex_torch.indexer import GeMM, Mean, topK
from vortex_torch.cache import Mean as CMean
from vortex_torch.abs import ContextBase
@register("rope_aware_block_sparse_mla")
class RopeAwareBlockSparseMLA(vFlowMLA):
def __init__(self):
super().__init__()
self.mean = Mean(dim=1) # average the fused query over its H heads
self.gemm = GeMM() # per-page score
self.output_func = topK()
self.reduction = CMean(dim=1) # centroid = mean of the fused latent per page
def forward_indexer(self, q, o, cache: Dict[str, torch.Tensor], ctx: ContextBase):
q_mean = self.mean(q, ctx=ctx) # [B, 1, latent_dim]
score = self.gemm(q_mean, cache["centroids"], ctx=ctx) # [S, 1, 1] — FULL logit
self.output_func(score, o, ctx=ctx)
def forward_cache(self, cache: Dict[str, torch.Tensor], loc, ctx: ContextBase):
self.reduction(cache["latent"], cache["centroids"], loc=loc, ctx=ctx)
def create_cache(self, block_size: int, kv_lora_rank: int, qk_rope_head_dim: int):
# "latent" is auto-provided — declare only the aux centroid (full width).
return {"centroids": (1, kv_lora_rank + qk_rope_head_dim)}
Launching is the same VortexConfig flow, with the MLA decode backend on the
engine and the tensor-core indexer enabled:
import sglang as sgl
import vortex_torch # noqa: F401
from vortex_torch.engine.sgl.config import VortexConfig
llm = sgl.Engine(
model_path="zai-org/GLM-4.7-Flash", # any MLA model (DeepSeek-V3, Kimi, …)
trust_remote_code=True,
page_size=32,
attention_backend="trtllm_mla", # Vortex CUDA MLA decode kernel
kv_cache_dtype="auto",
mem_fraction_static=0.9,
vortex=VortexConfig(
module_name="rope_aware_block_sparse_mla",
attention_backend="trtllm", # 2D block-table indexer
impl_backend="triton",
use_tensor_core=True,
block_size=32,
topk_val=61,
block_reserved_bos=1,
block_reserved_eos=2,
max_seq_lens=8192,
),
)
A runnable single-GPU MLA demo lives in examples/run_ruler_mla.py.
Server mode (OpenAI-compatible endpoint)¶
To serve Vortex over HTTP, use examples/server_launch.sh, which boots an SGLang
server with an OpenAI-compatible API on 127.0.0.1:30000:
# ./server_launch.sh <MODEL_NAME> <TP_SIZE>
examples/server_launch.sh Qwen/Qwen3-4B 1
Two details make server mode work:
import vortex_torchmust run first. The script doesn’t callpython -m sglang.launch_serverdirectly — that buildsServerArgsbefore Vortex is imported, so the adapter wouldn’t be installed yet. It importsvortex_torch, then calls SGLang’srun_server, so theServerArgs↔VortexConfigadapter is in place before the args are pickled to the worker.Knobs are passed as JSON via
--vortex-config. The per-knob--vortex-*flags no longer exist; the script writes theVortexConfigfields (prefix stripped) to a temp JSON file and feeds it through--vortex-config '<json>'. A non-null config implicitly enables sparsity.
Query it like any OpenAI endpoint:
curl http://127.0.0.1:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "Qwen/Qwen3-4B", "messages": [{"role": "user", "content": "Hello!"}]}'