vortex_torch.indexer.utils_sglang

Functions

get_decode_planner([policy])

get_decode_planner_trtllm([policy])

Decode planner variant that emits trtllm-ready outputs directly.

get_prefill_planner()

Mirror of get_decode_planner() for the prefill path.

get_chunkwise_nh2hn_transpose()

Factory for the Chunkwise_NH2HN_Transpose kernel; mirrors get_decode_planner()'s closure pattern so the module reference is bound once at backend init.

get_chunkwise_hn2nh_transpose()

Factory for the Chunkwise_HN2NH_Transpose kernel; mirrors get_decode_planner()'s closure pattern.

get_decode_planner(policy=None)[source]
Parameters:

policy (str)

get_decode_planner_trtllm(policy=None)[source]

Decode planner variant that emits trtllm-ready outputs directly.

The trtllm planner is indptr-free: it never reads or writes dense_kv_indptr / sparse_kv_indptr / dense_kv_indices / sparse_kv_indices. Outputs filled by the underlying CUDA kernel:

  • ctx.dense_block_tables — every selected page for the dense path

  • ctx.sparse_block_tables — only the BOS+EOS slots; the middle is filled by the topk kernel later

  • ctx.dense_seqlens / ctx.sparse_seqlens — int32 token counts consumed by trtllm_batch_decode_with_kv_cache, the trtllm topk kernels, and the Schedule.S Triton kernels (which derive per-row block counts via ceil(tokens / block_size))

  • ctx.kv_last_page_len — same semantics as before

  • ctx.winfo_* — workload-scheduler outputs; winfo_kv_offsets[j] carries row * max_blocks_per_seq + col so the Schedule.W kernel preamble (with indices = dense_block_tables.view(-1)) resolves page ids correctly.

Parameters:

policy (str)

get_prefill_planner()[source]

Mirror of get_decode_planner() for the prefill path.

Triggers a one-time JIT compile of the prefill module on first call, then returns a closure that calls sglang_plan_prefill with the module reference baked in (no per-call lookup).

get_chunkwise_nh2hn_transpose()[source]

Factory for the Chunkwise_NH2HN_Transpose kernel; mirrors get_decode_planner()’s closure pattern so the module reference is bound once at backend init.

get_chunkwise_hn2nh_transpose()[source]

Factory for the Chunkwise_HN2NH_Transpose kernel; mirrors get_decode_planner()’s closure pattern.