vortex_torch.indexer.context¶

Functions

get_ctx()

Classes

Context()

Static, single-instance indexer context.

Bases: ContextBase

Static, single-instance indexer context.

Holds only configuration that’s fixed for the lifetime of the compiled indexer — shapes, page/block sizes, head counts, allocation budgets, codegen scratch (op/tensor lists), backend identity, …

Per-forward-batch buffers (winfo_*, dense/sparse_kv_indptr+indices, dense/sparse_seqlens, dense/sparse_block_tables, kv_last_page_len) and batch_size live on a separate MetaData object, pre-allocated at attention-backend __init__ and exposed as ctx.metadata. Codegen emits ctx.metadata.<field> for any value that varies between forward batches.

Build pattern (from each attention backend’s _compile):: ctx = Context() ctx.create(self, model_runner) # static fields ctx.metadata = MetaData.preallocate(ctx, device=…)

metadata: MetaData | None¶

max_bs: int¶

vortex_attention_backend: str¶

max_num_workloads: int¶

workload_chunk_size: int¶

group_size: int¶

num_kv_heads: int¶

num_qo_heads: int¶

head_dim: int¶

num_sms: int¶

page_size: int¶

max_num_pages: int¶

max_num_pages_per_request: int¶

block_size: int¶

max_num_blocks: int¶

max_num_blocks_per_request: int¶

num_blocks_per_page: int¶

num_pages_per_workload: int¶

topk_val: int¶

topk_ratio: float¶

block_reserved_bos: int¶

block_reserved_eos: int¶

max_topk_val: int | None¶

tensor_list: list¶

op_list: list¶

output_tensor_to_op_list: list¶

op_to_input_tensor_list: list¶

op_to_output_tensor_list: list¶

side_effect_op_ids: list¶

sparse_attention_name: str¶

impl_backend: str¶

tensor_id_to_tensor_name_map: dict¶

compilation_header_lines: list¶

auxilary_func_def_lines: list¶

compilation_cache_dir: str¶

use_tensor_core: bool¶

property batch_size: int¶

Current batch size — proxied from self.metadata.

Kept as a property (not a slot) so the only writable copy lives on self.metadata; ctx.batch_size is now read-only.

set_batch_size(n)[source]¶

Compatibility shim — forwards to self.metadata.set_batch_size.

Parameters:: n (int)
Return type:: None

create(parent, model_runner, *, overwrite=False)[source]¶

Populate the static fields. Per-batch MetaData is allocated separately by the caller via MetaData.preallocate(ctx, device=...) — see this class’s docstring.

Parameters:

parent (Any)
model_runner (Any)
overwrite (bool)

Return type:

Context

name: str¶: Human-readable context name.

mode: Literal['profile', 'execute']¶: Current operating mode.

vortex_dtype: torch.dtype¶: Intermediate-tensor dtype (default torch.bfloat16).

query_arg_names¶

class MetaData[source]¶

Bases: object

Per-forward-batch state for the indexer. See module docstring.

winfo_q_indices: torch.Tensor¶

winfo_is_first_workload_per_batch: torch.Tensor¶

winfo_kv_offsets: torch.Tensor¶

winfo_kv_lens: torch.Tensor¶

winfo_num_workloads: torch.Tensor¶

winfo_chunk_size: torch.Tensor¶

dense_kv_indptr: torch.Tensor | None¶

sparse_kv_indptr: torch.Tensor | None¶

dense_kv_indices: torch.Tensor | None¶

sparse_kv_indices: torch.Tensor | None¶

dense_seqlens: torch.Tensor | None¶

sparse_seqlens: torch.Tensor | None¶

dense_block_tables: torch.Tensor | None¶

sparse_block_tables: torch.Tensor | None¶

kv_last_page_len: torch.Tensor¶

batch_size: int¶

classmethod preallocate(ctx, *, device)[source]¶

Build a backend-appropriate MetaData from a populated Context.

Picks the buffer set from ctx.vortex_attention_backend:

"flashinfer" → allocates CSR buffers (dense/sparse_kv_indptr, dense/sparse_kv_indices); leaves block-table buffers as None.
"trtllm" → allocates 2D block_tables + per-row seqlens; leaves CSR buffers as None.

winfo_* and kv_last_page_len are always allocated.

Parameters:

ctx (Context)
device (torch.device | str)

Return type:

MetaData

set_batch_size(n)[source]¶

Parameters:: n (int)
Return type:: None

get_ctx()[source]¶

Return type:: Context