vortex_torch.cache.context¶

Classes

Context()

Mutable, single-instance cache context; populate later via .create(...).

Bases: ContextBase

Mutable, single-instance cache context; populate later via .create(…).

Beyond the minimal runtime knobs (page/block layout, head shape) this context also carries the graph and codegen state used by the cache compiler — mirroring vortex_torch.indexer.context.Context.

The graph state (tensor_list, op_list, op_to_input_tensor_list, op_to_output_tensor_list, output_tensor_to_op_list) is populated during the profile phase as cache ops register themselves; the codegen state (compilation_header_lines, auxilary_func_def_lines, tensor_id_to_tensor_name_map, compilation_cache_dir, sparse_attention_name, impl_backend) is consumed by vortex_torch.cache.compiler.

max_new_tokens_per_batch: int¶: Max number of new tokens per batch.

page_size: int¶: Page size used for memory paging.

total_num_pages: int¶: Total available pages in the cache.

block_size: int¶: Block size (page_size % block_size == 0).

num_blocks_per_page: int¶: page_size // block_size.

total_num_blocks: int¶: num_blocks_per_page * total_num_pages.

head_dim: int¶: Dimension per attention head.

head_num: int¶: Number of (KV) heads.

num_sms: int¶: Number of streaming multiprocessors.

tensor_list: list¶: All vTensors that flow through the graph.

op_list: list¶: All registered vOps in profile order.

output_tensor_to_op_list: list¶: For each tensor_id, the producing op_id (or None).

op_to_input_tensor_list: list¶: For each op_id, the list of input tensor_ids.

op_to_output_tensor_list: list¶: For each op_id, the (single) output tensor_id wrapped in a list.

sparse_attention_name: str¶: Unique name for the cache-pipeline instance.

impl_backend: str¶: Implementation backend (default "triton").

tensor_id_to_tensor_name_map: dict¶: tensor_id -> Python name in the generated code.

compilation_header_lines: list¶: Lines inserted at the top of the generated module.

auxilary_func_def_lines: list¶: Auxiliary function/kernel definitions to embed.

compilation_cache_dir: str¶: Where to write the generated module.

create(parent, model_runner, *, overwrite=False)[source]¶

Populate this instance once (no locking). Set overwrite=True to allow re-init. NOTE: Without locking, concurrent callers may race; call from a single thread.

Parameters:

parent (Any)
model_runner (Any)
overwrite (bool)

Return type:

Context

name: str¶: Human-readable context name.

mode: Literal['profile', 'execute']¶: Current operating mode.

vortex_dtype: torch.dtype¶: Intermediate-tensor dtype (default torch.bfloat16).

get_ctx()[source]¶

Return type:: Context