vortex_torch.indexer.utils_sglang¶

Functions

`get_decode_planner`([policy])
`get_decode_planner_trtllm`([policy])	Decode planner variant that emits trtllm-ready outputs directly.
`get_prefill_planner`()	Mirror of `get_decode_planner()` for the prefill path.
`get_chunkwise_nh2hn_transpose`()	Factory for the `Chunkwise_NH2HN_Transpose` kernel; mirrors `get_decode_planner()`'s closure pattern so the module reference is bound once at backend init.
`get_chunkwise_hn2nh_transpose`()	Factory for the `Chunkwise_HN2NH_Transpose` kernel; mirrors `get_decode_planner()`'s closure pattern.

get_decode_planner(policy=None)[source]¶

Parameters:: policy (str)

get_decode_planner_trtllm(policy=None)[source]¶

Decode planner variant that emits trtllm-ready outputs directly.

The trtllm planner is indptr-free: it never reads or writes dense_kv_indptr / sparse_kv_indptr / dense_kv_indices / sparse_kv_indices. Outputs filled by the underlying CUDA kernel:

ctx.dense_block_tables — every selected page for the dense path

ctx.sparse_block_tables — only the BOS+EOS slots; the middle is filled by the topk kernel later

ctx.dense_seqlens / ctx.sparse_seqlens — int32 token counts consumed by trtllm_batch_decode_with_kv_cache, the trtllm topk kernels, and the Schedule.S Triton kernels (which derive per-row block counts via ceil(tokens / block_size))

ctx.kv_last_page_len — same semantics as before

ctx.winfo_* — workload-scheduler outputs; winfo_kv_offsets[j] carries row * max_blocks_per_seq + col so the Schedule.W kernel preamble (with indices = dense_block_tables.view(-1)) resolves page ids correctly.

Parameters:: policy (str)

get_prefill_planner()[source]¶

Mirror of get_decode_planner() for the prefill path.

Triggers a one-time JIT compile of the prefill module on first call, then returns a closure that calls sglang_plan_prefill with the module reference baked in (no per-call lookup).

get_chunkwise_nh2hn_transpose()[source]¶: Factory for the Chunkwise_NH2HN_Transpose kernel; mirrors get_decode_planner()’s closure pattern so the module reference is bound once at backend init.

get_chunkwise_hn2nh_transpose()[source]¶: Factory for the Chunkwise_HN2NH_Transpose kernel; mirrors get_decode_planner()’s closure pattern.