vortex_torch.cache.context

Classes

Context()

Mutable, single-instance cache context; populate later via .create(...).

class Context[source]

Bases: ContextBase

Mutable, single-instance cache context; populate later via .create(…).

Beyond the minimal runtime knobs (page/block layout, head shape) this context also carries the graph and codegen state used by the cache compiler — mirroring vortex_torch.indexer.context.Context.

The graph state (tensor_list, op_list, op_to_input_tensor_list, op_to_output_tensor_list, output_tensor_to_op_list) is populated during the profile phase as cache ops register themselves; the codegen state (compilation_header_lines, auxilary_func_def_lines, tensor_id_to_tensor_name_map, compilation_cache_dir, sparse_attention_name, impl_backend) is consumed by vortex_torch.cache.compiler.

max_new_tokens_per_batch: int

Max number of new tokens per batch.

page_size: int

Page size used for memory paging.

total_num_pages: int

Total available pages in the cache.

block_size: int

Block size (page_size % block_size == 0).

num_blocks_per_page: int

page_size // block_size.

total_num_blocks: int

num_blocks_per_page * total_num_pages.

head_dim: int

Dimension per attention head.

head_num: int

Number of (KV) heads.

num_sms: int

Number of streaming multiprocessors.

tensor_list: list

All vTensors that flow through the graph.

op_list: list

All registered vOps in profile order.

output_tensor_to_op_list: list

For each tensor_id, the producing op_id (or None).

op_to_input_tensor_list: list

For each op_id, the list of input tensor_ids.

op_to_output_tensor_list: list

For each op_id, the (single) output tensor_id wrapped in a list.

sparse_attention_name: str

Unique name for the cache-pipeline instance.

impl_backend: str

Implementation backend (default "triton").

tensor_id_to_tensor_name_map: dict

tensor_id -> Python name in the generated code.

compilation_header_lines: list

Lines inserted at the top of the generated module.

auxilary_func_def_lines: list

Auxiliary function/kernel definitions to embed.

compilation_cache_dir: str

Where to write the generated module.

create(parent, model_runner, *, overwrite=False)[source]

Populate this instance once (no locking). Set overwrite=True to allow re-init. NOTE: Without locking, concurrent callers may race; call from a single thread.

Parameters:
  • parent (Any)

  • model_runner (Any)

  • overwrite (bool)

Return type:

Context

name: str

Human-readable context name.

mode: Literal['profile', 'execute']

Current operating mode.

vortex_dtype: torch.dtype

Intermediate-tensor dtype (default torch.bfloat16).

get_ctx()[source]
Return type:

Context