# Examples

This page collects the detailed recipes: configuring a run with `VortexConfig`,
writing a programmable per-sequence budget, targeting **MLA** models, and serving
over HTTP. If you haven't yet, start with the [Quick Start](quickstart.md).

## Anatomy of a flow

Every flow is a `vFlow` with three methods, written entirely from
`vortex_torch.indexer.*` / `vortex_torch.cache.*` ops (no native torch):

| Method | Runs | Job |
| --- | --- | --- |
| `create_cache` | once at setup | Declare auxiliary per-page state (e.g. centroids). `"k"`/`"v"` are auto-provided — don't declare them. |
| `forward_cache` | once per finished page | Reduce keys/values into that state (cross-block reductions belong on the indexer side). |
| `forward_indexer` | every decode step | Score cached pages from the query and emit the selected pages. **Must end in** `topK(...)` or `approxTopK(tolerate_ratio=...)(...)`. |

See {mod}`vortex_torch.indexer` and {mod}`vortex_torch.cache` for the full op
set, and {mod}`vortex_torch.flow.algorithms` for ready-made flows (block-sparse,
Quest-style envelopes, LServe sub-block centroids, …).

## What is `VortexConfig`?

`VortexConfig` ({mod}`vortex_torch.engine.sgl.config`) is a single dataclass that
holds **every** Vortex hyper-parameter in one place, instead of ~18 loose
`vortex_*` arguments scattered across SGLang's `ServerArgs`. Its presence on the
engine is also the on/off switch: pass a `VortexConfig` and sparsity is enabled;
leave it out and the model runs ordinary dense attention.

```python
from vortex_torch.engine.sgl.config import VortexConfig

cfg = VortexConfig(
    module_name="custom_sparse_attention",
    topk_val=30,
    layers_skip=[0],
)
llm = sgl.Engine(model_path="Qwen/Qwen3-0.6B", attention_backend="flashinfer",
                 disable_overlap_schedule=True, vortex=cfg)
```

Every field, with what it controls and an example value:

| Field | Explanation | Example |
| --- | --- | --- |
| `module_path` | Path to your flow's `.py` file. `None` → search `vortex_torch.flow.algorithms`. | `"submissions/custom.py"` |
| `module_name` | The `@register(...)` name of the `vFlow` to load. Must match exactly. | `"custom_sparse_attention"` |
| `topk_val` | **Static page budget** — fixed minimum pages each sequence keeps. The core accuracy↔throughput knob. | `30` |
| `topk_ratio` | **Dynamic page budget** — a fraction of the sequence's pages; engine keeps `max(static floor, topk_ratio × pages)`. `0.0` disables it. | `0.0625` |
| `max_topk_val` | Upper bound on the selected-page count, used to size/pick the top-k kernel. `None` → derived from `max_seq_lens`. | `256` |
| `layers_skip` | Layer indices that **bypass sparse attention and run dense**. `None` → all sparse. | `[0, 4, 8, 12]` |
| `block_reserved_bos` | Pages at the **start** always selected (attention sink). Int ≥ 1. | `1` |
| `block_reserved_eos` | Pages at the **end** (most recent) always selected. Int ≥ 1. | `1` |
| `max_seq_lens` | Maximum sequence length to plan buffers for. `-1` → model default. | `8192` |
| `block_size` | Vortex **page size** (unit of sparsity). Power of 2; defaults to SGLang's `page_size`. | `16` |
| `workload_chunk_size` | Planner granularity — blocks grouped into one indexer workload. Power of 2. | `32` |
| `dtype` | dtype for **intermediate** indexer tensors. | `"bfloat16"` |
| `compilation_cache_dir` | Directory for the JIT kernel cache. `None` → next to the compiler module. | `"~/.vortex_compilation_cache"` |
| `schedule_policy` | A CUDA C++ snippet computing each sequence's page budget (see below). `None` → default formula. | `None` |
| `attention_backend` | Sparse-attention kernel family: `"flashinfer"` (default) or `"trtllm"`. | `"flashinfer"` |
| `impl_backend` | Indexer op implementation backend: `"triton"` (default) or `"cuda"`. | `"triton"` |
| `use_tensor_core` | Tensor-core (`bf16 tl.dot`) codegen in the Triton kernel. Triton-only. | `False` |

```{tip}
**Budget recap:** pages attended per sequence ≈
`min(num_pages, max(topk_val + bos + eos, topk_ratio × num_pages))`.
`topk_val` dominates short sequences, `topk_ratio` long ones.
```

The legacy flat form —
`sgl.Engine(enable_vortex_sparsity=True, vortex_topk_val=30, vortex_module_name=..., ...)`
— still works (the adapter folds those `vortex_*` kwargs into a `VortexConfig`),
but the explicit object is clearer and self-documenting.

## Programmable budget — the `schedule_policy`

Instead of a fixed formula, the per-sequence **page budget can be computed by a
CUDA C++ snippet you provide**. Vortex injects it as the body of a `__device__`
function, JIT-compiles it into the decode planner (cached by content hash), and
runs it for every sequence on every backend. The default body *is* the standard
budget formula:

```cpp
// default schedule_policy — returns the number of pages to attend to.
const int static_kv_budget  = topk_val + block_reserved_bos + block_reserved_eos;
const int dynamic_kv_budget = int(cached_block_len * topk_ratio);
return max(static_kv_budget, dynamic_kv_budget);
```

The snippet must `return` an `int`. In scope: `cached_block_len` (the sequence's
length in pages), `topk_val`, `topk_ratio`, `block_reserved_bos`,
`block_reserved_eos`. Because it's real device code, you can express budgets the
scalar knobs can't — e.g. a length-adaptive budget that grows slowly and caps:

```python
vortex=VortexConfig(
    module_name="custom_sparse_attention",
    topk_val=32,
    schedule_policy=r"""
        // base budget + 1 extra page per 64 cached pages, capped at 256
        const int base  = topk_val + block_reserved_bos + block_reserved_eos;
        const int extra = cached_block_len / 64;
        return min(base + extra, 256);
    """,
)
```

The planner is JIT-compiled once per distinct snippet, so there's no per-step
overhead.

## MLA models (DeepSeek-V3 / GLM-4.7 / Kimi-style)

Models with **Multi-head Latent Attention** compress the KV cache into a single
shared low-rank *latent*. Vortex supports them with a parallel base class,
{class}`~vortex_torch.flow.flow_mla.vFlowMLA`:

- The cache exposes **one auto-provided field, `cache["latent"]`** (the fused
  `[ kv_c | k_pe ]`) — there is no `"k"`/`"v"`.
- `create_cache(block_size, kv_lora_rank, qk_rope_head_dim)` declares only your
  aux tensors.
- `forward_indexer` receives the **fused absorbed query** `[ q_nope_out | q_pe ]`;
  a single dot `⟨q, centroid⟩` equals the full decode logit (RoPE included).

```python
from typing import Dict
import torch

from vortex_torch.flow import vFlowMLA, register
from vortex_torch.indexer import GeMM, Mean, topK
from vortex_torch.cache import Mean as CMean
from vortex_torch.abs import ContextBase


@register("rope_aware_block_sparse_mla")
class RopeAwareBlockSparseMLA(vFlowMLA):

    def __init__(self):
        super().__init__()
        self.mean = Mean(dim=1)        # average the fused query over its H heads
        self.gemm = GeMM()             # per-page score
        self.output_func = topK()
        self.reduction = CMean(dim=1)  # centroid = mean of the fused latent per page

    def forward_indexer(self, q, o, cache: Dict[str, torch.Tensor], ctx: ContextBase):
        q_mean = self.mean(q, ctx=ctx)                          # [B, 1, latent_dim]
        score = self.gemm(q_mean, cache["centroids"], ctx=ctx)  # [S, 1, 1] — FULL logit
        self.output_func(score, o, ctx=ctx)

    def forward_cache(self, cache: Dict[str, torch.Tensor], loc, ctx: ContextBase):
        self.reduction(cache["latent"], cache["centroids"], loc=loc, ctx=ctx)

    def create_cache(self, block_size: int, kv_lora_rank: int, qk_rope_head_dim: int):
        # "latent" is auto-provided — declare only the aux centroid (full width).
        return {"centroids": (1, kv_lora_rank + qk_rope_head_dim)}
```

Launching is the same `VortexConfig` flow, with the **MLA decode backend** on the
engine and the tensor-core indexer enabled:

```python
import sglang as sgl
import vortex_torch  # noqa: F401
from vortex_torch.engine.sgl.config import VortexConfig

llm = sgl.Engine(
    model_path="zai-org/GLM-4.7-Flash",   # any MLA model (DeepSeek-V3, Kimi, …)
    trust_remote_code=True,
    page_size=32,
    attention_backend="trtllm_mla",       # Vortex CUDA MLA decode kernel
    kv_cache_dtype="auto",
    mem_fraction_static=0.9,
    vortex=VortexConfig(
        module_name="rope_aware_block_sparse_mla",
        attention_backend="trtllm",       # 2D block-table indexer
        impl_backend="triton",
        use_tensor_core=True,
        block_size=32,
        topk_val=61,
        block_reserved_bos=1,
        block_reserved_eos=2,
        max_seq_lens=8192,
    ),
)
```

A runnable single-GPU MLA demo lives in `examples/run_ruler_mla.py`.

## Server mode (OpenAI-compatible endpoint)

To serve Vortex over HTTP, use `examples/server_launch.sh`, which boots an SGLang
server with an OpenAI-compatible API on `127.0.0.1:30000`:

```bash
# ./server_launch.sh <MODEL_NAME> <TP_SIZE>
examples/server_launch.sh Qwen/Qwen3-4B 1
```

Two details make server mode work:

1. **`import vortex_torch` must run first.** The script doesn't call
   `python -m sglang.launch_server` directly — that builds `ServerArgs` before
   Vortex is imported, so the adapter wouldn't be installed yet. It imports
   `vortex_torch`, then calls SGLang's `run_server`, so the `ServerArgs` ↔
   `VortexConfig` adapter is in place before the args are pickled to the worker.
2. **Knobs are passed as JSON via `--vortex-config`.** The per-knob `--vortex-*`
   flags no longer exist; the script writes the `VortexConfig` fields (prefix
   stripped) to a temp JSON file and feeds it through `--vortex-config '<json>'`.
   A non-null config implicitly enables sparsity.

Query it like any OpenAI endpoint:

```bash
curl http://127.0.0.1:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen3-4B", "messages": [{"role": "user", "content": "Hello!"}]}'
```