vortex_torch.indexer.utils_sglang¶
Functions
|
|
|
Decode planner variant that emits trtllm-ready outputs directly. |
Mirror of |
|
Factory for the |
|
Factory for the |
- get_decode_planner_trtllm(policy=None)[source]¶
Decode planner variant that emits trtllm-ready outputs directly.
The trtllm planner is indptr-free: it never reads or writes
dense_kv_indptr/sparse_kv_indptr/dense_kv_indices/sparse_kv_indices. Outputs filled by the underlying CUDA kernel:ctx.dense_block_tables— every selected page for the dense pathctx.sparse_block_tables— only the BOS+EOS slots; the middle is filled by the topk kernel laterctx.dense_seqlens/ctx.sparse_seqlens— int32 token counts consumed bytrtllm_batch_decode_with_kv_cache, the trtllm topk kernels, and the Schedule.S Triton kernels (which derive per-row block counts viaceil(tokens / block_size))ctx.kv_last_page_len— same semantics as beforectx.winfo_*— workload-scheduler outputs;winfo_kv_offsets[j]carriesrow * max_blocks_per_seq + colso the Schedule.W kernel preamble (withindices = dense_block_tables.view(-1)) resolves page ids correctly.
- Parameters:
policy (str)
- get_prefill_planner()[source]¶
Mirror of
get_decode_planner()for the prefill path.Triggers a one-time JIT compile of the prefill module on first call, then returns a closure that calls
sglang_plan_prefillwith the module reference baked in (no per-call lookup).
- get_chunkwise_nh2hn_transpose()[source]¶
Factory for the
Chunkwise_NH2HN_Transposekernel; mirrorsget_decode_planner()’s closure pattern so the module reference is bound once at backend init.
- get_chunkwise_hn2nh_transpose()[source]¶
Factory for the
Chunkwise_HN2NH_Transposekernel; mirrorsget_decode_planner()’s closure pattern.