Examples

This page collects the detailed recipes: configuring a run with VortexConfig, writing a programmable per-sequence budget, targeting MLA models, and serving over HTTP. If you haven’t yet, start with the Quick Start.

Anatomy of a flow

Every flow is a vFlow with three methods, written entirely from vortex_torch.indexer.* / vortex_torch.cache.* ops (no native torch):

Method

Runs

Job

create_cache

once at setup

Declare auxiliary per-page state (e.g. centroids). "k"/"v" are auto-provided — don’t declare them.

forward_cache

once per finished page

Reduce keys/values into that state (cross-block reductions belong on the indexer side).

forward_indexer

every decode step

Score cached pages from the query and emit the selected pages. Must end in topK(...) or approxTopK(tolerate_ratio=...)(...).

See vortex_torch.indexer and vortex_torch.cache for the full op set, and vortex_torch.flow.algorithms for ready-made flows (block-sparse, Quest-style envelopes, LServe sub-block centroids, …).

What is VortexConfig?

VortexConfig (vortex_torch.engine.sgl.config) is a single dataclass that holds every Vortex hyper-parameter in one place, instead of ~18 loose vortex_* arguments scattered across SGLang’s ServerArgs. Its presence on the engine is also the on/off switch: pass a VortexConfig and sparsity is enabled; leave it out and the model runs ordinary dense attention.

from vortex_torch.engine.sgl.config import VortexConfig

cfg = VortexConfig(
    module_name="custom_sparse_attention",
    topk_val=30,
    layers_skip=[0],
)
llm = sgl.Engine(model_path="Qwen/Qwen3-0.6B", attention_backend="flashinfer",
                 disable_overlap_schedule=True, vortex=cfg)

Every field, with what it controls and an example value:

Field

Explanation

Example

module_path

Path to your flow’s .py file. None → search vortex_torch.flow.algorithms.

"submissions/custom.py"

module_name

The @register(...) name of the vFlow to load. Must match exactly.

"custom_sparse_attention"

topk_val

Static page budget — fixed minimum pages each sequence keeps. The core accuracy↔throughput knob.

30

topk_ratio

Dynamic page budget — a fraction of the sequence’s pages; engine keeps max(static floor, topk_ratio × pages). 0.0 disables it.

0.0625

max_topk_val

Upper bound on the selected-page count, used to size/pick the top-k kernel. None → derived from max_seq_lens.

256

layers_skip

Layer indices that bypass sparse attention and run dense. None → all sparse.

[0, 4, 8, 12]

block_reserved_bos

Pages at the start always selected (attention sink). Int ≥ 1.

1

block_reserved_eos

Pages at the end (most recent) always selected. Int ≥ 1.

1

max_seq_lens

Maximum sequence length to plan buffers for. -1 → model default.

8192

block_size

Vortex page size (unit of sparsity). Power of 2; defaults to SGLang’s page_size.

16

workload_chunk_size

Planner granularity — blocks grouped into one indexer workload. Power of 2.

32

dtype

dtype for intermediate indexer tensors.

"bfloat16"

compilation_cache_dir

Directory for the JIT kernel cache. None → next to the compiler module.

"~/.vortex_compilation_cache"

schedule_policy

A CUDA C++ snippet computing each sequence’s page budget (see below). None → default formula.

None

attention_backend

Sparse-attention kernel family: "flashinfer" (default) or "trtllm".

"flashinfer"

impl_backend

Indexer op implementation backend: "triton" (default) or "cuda".

"triton"

use_tensor_core

Tensor-core (bf16 tl.dot) codegen in the Triton kernel. Triton-only.

False

Tip

Budget recap: pages attended per sequence ≈ min(num_pages, max(topk_val + bos + eos, topk_ratio × num_pages)). topk_val dominates short sequences, topk_ratio long ones.

The legacy flat form — sgl.Engine(enable_vortex_sparsity=True, vortex_topk_val=30, vortex_module_name=..., ...) — still works (the adapter folds those vortex_* kwargs into a VortexConfig), but the explicit object is clearer and self-documenting.

Programmable budget — the schedule_policy

Instead of a fixed formula, the per-sequence page budget can be computed by a CUDA C++ snippet you provide. Vortex injects it as the body of a __device__ function, JIT-compiles it into the decode planner (cached by content hash), and runs it for every sequence on every backend. The default body is the standard budget formula:

// default schedule_policy — returns the number of pages to attend to.
const int static_kv_budget  = topk_val + block_reserved_bos + block_reserved_eos;
const int dynamic_kv_budget = int(cached_block_len * topk_ratio);
return max(static_kv_budget, dynamic_kv_budget);

The snippet must return an int. In scope: cached_block_len (the sequence’s length in pages), topk_val, topk_ratio, block_reserved_bos, block_reserved_eos. Because it’s real device code, you can express budgets the scalar knobs can’t — e.g. a length-adaptive budget that grows slowly and caps:

vortex=VortexConfig(
    module_name="custom_sparse_attention",
    topk_val=32,
    schedule_policy=r"""
        // base budget + 1 extra page per 64 cached pages, capped at 256
        const int base  = topk_val + block_reserved_bos + block_reserved_eos;
        const int extra = cached_block_len / 64;
        return min(base + extra, 256);
    """,
)

The planner is JIT-compiled once per distinct snippet, so there’s no per-step overhead.

MLA models (DeepSeek-V3 / GLM-4.7 / Kimi-style)

Models with Multi-head Latent Attention compress the KV cache into a single shared low-rank latent. Vortex supports them with a parallel base class, vFlowMLA:

  • The cache exposes one auto-provided field, cache["latent"] (the fused [ kv_c | k_pe ]) — there is no "k"/"v".

  • create_cache(block_size, kv_lora_rank, qk_rope_head_dim) declares only your aux tensors.

  • forward_indexer receives the fused absorbed query [ q_nope_out | q_pe ]; a single dot ⟨q, centroid⟩ equals the full decode logit (RoPE included).

from typing import Dict
import torch

from vortex_torch.flow import vFlowMLA, register
from vortex_torch.indexer import GeMM, Mean, topK
from vortex_torch.cache import Mean as CMean
from vortex_torch.abs import ContextBase


@register("rope_aware_block_sparse_mla")
class RopeAwareBlockSparseMLA(vFlowMLA):

    def __init__(self):
        super().__init__()
        self.mean = Mean(dim=1)        # average the fused query over its H heads
        self.gemm = GeMM()             # per-page score
        self.output_func = topK()
        self.reduction = CMean(dim=1)  # centroid = mean of the fused latent per page

    def forward_indexer(self, q, o, cache: Dict[str, torch.Tensor], ctx: ContextBase):
        q_mean = self.mean(q, ctx=ctx)                          # [B, 1, latent_dim]
        score = self.gemm(q_mean, cache["centroids"], ctx=ctx)  # [S, 1, 1] — FULL logit
        self.output_func(score, o, ctx=ctx)

    def forward_cache(self, cache: Dict[str, torch.Tensor], loc, ctx: ContextBase):
        self.reduction(cache["latent"], cache["centroids"], loc=loc, ctx=ctx)

    def create_cache(self, block_size: int, kv_lora_rank: int, qk_rope_head_dim: int):
        # "latent" is auto-provided — declare only the aux centroid (full width).
        return {"centroids": (1, kv_lora_rank + qk_rope_head_dim)}

Launching is the same VortexConfig flow, with the MLA decode backend on the engine and the tensor-core indexer enabled:

import sglang as sgl
import vortex_torch  # noqa: F401
from vortex_torch.engine.sgl.config import VortexConfig

llm = sgl.Engine(
    model_path="zai-org/GLM-4.7-Flash",   # any MLA model (DeepSeek-V3, Kimi, …)
    trust_remote_code=True,
    page_size=32,
    attention_backend="trtllm_mla",       # Vortex CUDA MLA decode kernel
    kv_cache_dtype="auto",
    mem_fraction_static=0.9,
    vortex=VortexConfig(
        module_name="rope_aware_block_sparse_mla",
        attention_backend="trtllm",       # 2D block-table indexer
        impl_backend="triton",
        use_tensor_core=True,
        block_size=32,
        topk_val=61,
        block_reserved_bos=1,
        block_reserved_eos=2,
        max_seq_lens=8192,
    ),
)

A runnable single-GPU MLA demo lives in examples/run_ruler_mla.py.

Server mode (OpenAI-compatible endpoint)

To serve Vortex over HTTP, use examples/server_launch.sh, which boots an SGLang server with an OpenAI-compatible API on 127.0.0.1:30000:

# ./server_launch.sh <MODEL_NAME> <TP_SIZE>
examples/server_launch.sh Qwen/Qwen3-4B 1

Two details make server mode work:

  1. import vortex_torch must run first. The script doesn’t call python -m sglang.launch_server directly — that builds ServerArgs before Vortex is imported, so the adapter wouldn’t be installed yet. It imports vortex_torch, then calls SGLang’s run_server, so the ServerArgsVortexConfig adapter is in place before the args are pickled to the worker.

  2. Knobs are passed as JSON via --vortex-config. The per-knob --vortex-* flags no longer exist; the script writes the VortexConfig fields (prefix stripped) to a temp JSON file and feeds it through --vortex-config '<json>'. A non-null config implicitly enables sparsity.

Query it like any OpenAI endpoint:

curl http://127.0.0.1:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-4B", "messages": [{"role": "user", "content": "Hello!"}]}'