vortex_torch.indexer.context

Functions

Classes

Context()

Static, single-instance indexer context.

class Context[source]

Bases: ContextBase

Static, single-instance indexer context.

Holds only configuration that’s fixed for the lifetime of the compiled indexer — shapes, page/block sizes, head counts, allocation budgets, codegen scratch (op/tensor lists), backend identity, …

Per-forward-batch buffers (winfo_*, dense/sparse_kv_indptr+indices, dense/sparse_seqlens, dense/sparse_block_tables, kv_last_page_len) and batch_size live on a separate MetaData object, pre-allocated at attention-backend __init__ and exposed as ctx.metadata. Codegen emits ctx.metadata.<field> for any value that varies between forward batches.

Build pattern (from each attention backend’s _compile):

ctx = Context() ctx.create(self, model_runner) # static fields ctx.metadata = MetaData.preallocate(ctx, device=…)

metadata: MetaData | None
max_bs: int
vortex_attention_backend: str
max_num_workloads: int
workload_chunk_size: int
group_size: int
num_kv_heads: int
num_qo_heads: int
head_dim: int
num_sms: int
page_size: int
max_num_pages: int
max_num_pages_per_request: int
block_size: int
max_num_blocks: int
max_num_blocks_per_request: int
num_blocks_per_page: int
num_pages_per_workload: int
topk_val: int
topk_ratio: float
block_reserved_bos: int
block_reserved_eos: int
max_topk_val: int | None
tensor_list: list
op_list: list
output_tensor_to_op_list: list
op_to_input_tensor_list: list
op_to_output_tensor_list: list
side_effect_op_ids: list
sparse_attention_name: str
impl_backend: str
tensor_id_to_tensor_name_map: dict
compilation_header_lines: list
auxilary_func_def_lines: list
compilation_cache_dir: str
use_tensor_core: bool
property batch_size: int

Current batch size — proxied from self.metadata.

Kept as a property (not a slot) so the only writable copy lives on self.metadata; ctx.batch_size is now read-only.

set_batch_size(n)[source]

Compatibility shim — forwards to self.metadata.set_batch_size.

Parameters:

n (int)

Return type:

None

create(parent, model_runner, *, overwrite=False)[source]

Populate the static fields. Per-batch MetaData is allocated separately by the caller via MetaData.preallocate(ctx, device=...) — see this class’s docstring.

Parameters:
  • parent (Any)

  • model_runner (Any)

  • overwrite (bool)

Return type:

Context

name: str

Human-readable context name.

mode: Literal['profile', 'execute']

Current operating mode.

vortex_dtype: torch.dtype

Intermediate-tensor dtype (default torch.bfloat16).

query_arg_names
class MetaData[source]

Bases: object

Per-forward-batch state for the indexer. See module docstring.

winfo_q_indices: torch.Tensor
winfo_is_first_workload_per_batch: torch.Tensor
winfo_kv_offsets: torch.Tensor
winfo_kv_lens: torch.Tensor
winfo_num_workloads: torch.Tensor
winfo_chunk_size: torch.Tensor
dense_kv_indptr: torch.Tensor | None
sparse_kv_indptr: torch.Tensor | None
dense_kv_indices: torch.Tensor | None
sparse_kv_indices: torch.Tensor | None
dense_seqlens: torch.Tensor | None
sparse_seqlens: torch.Tensor | None
dense_block_tables: torch.Tensor | None
sparse_block_tables: torch.Tensor | None
kv_last_page_len: torch.Tensor
batch_size: int
classmethod preallocate(ctx, *, device)[source]

Build a backend-appropriate MetaData from a populated Context.

Picks the buffer set from ctx.vortex_attention_backend:
  • "flashinfer" → allocates CSR buffers (dense/sparse_kv_indptr, dense/sparse_kv_indices); leaves block-table buffers as None.

  • "trtllm" → allocates 2D block_tables + per-row seqlens; leaves CSR buffers as None.

winfo_* and kv_last_page_len are always allocated.

Parameters:
Return type:

MetaData

set_batch_size(n)[source]
Parameters:

n (int)

Return type:

None

get_ctx()[source]
Return type:

Context