TL;DR: We introduce Kinetics, which challenges the traditional test-time scaling (TTS) laws by adopting a practical efficiency perspective. It reveals that prior compute-optimal approaches overlook major key-value memory access bottlenecks in various TTS strategies. By jointly considering memory and compute, the Kinetics scaling law shows that it is more efficient to scale model size up to a threshold before investing more compute in test-time scaling. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which fundamentally reshapes the scaling law. According to our Sparse Kinetics scaling law, sparse attention significantly enhances the scalability and performance of TTS and becomes increasingly valuable in high-cost scenarios.
The existing compute-optimal test-time scaling law often favors smaller models with more test-time compute. We revisit this paradigm from a practical efficiency perspective, uncovering that the effectiveness of small models is often overestimated due to overlooked key-value (KV) memory bottlenecks introduced by inference strategies such as Best-of-N or long CoT chains.
To address this, we propose the Kinetics Scaling Law, which jointly considers both compute and memory costs. Unlike prior work, it shows that test-time compute is best allocated to scaling model size—up to a threshold (e.g., 14B parameters for Qwen3)—before increasing generation length. (See Figure 1)
Furthermore, we demonstrate that sparse attention unlocks new scaling opportunities by mitigating KV memory overhead, enabling longer generations and more parallel reasoning trials within the same budget. This leads to substantial gains in test-time accuracy and efficiency. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. (See Figure 2). Moreover, according to our Sparse Kinetics scaling law, computational resources are best allocated to test-time strategies rather than reducing sparsity.
We demonstrate the practicality of Sparse Kinetics using a simple block-sparse attention mechanism, which is known to be hardware-efficient and shows a promising scalability comparable to the oracle TopK attention. Block topk attention achieves up to a 3.2 to 33.3× throughput improvement on H200 GPUs.
While sparsity has traditionally been employed either for regularization in small models, or to reduce computation in over-parameterized networks, our work introduces a fundamentally different perspective: sparsity as a central enabler of efficient and scalable test-time compute. In contrast to pretraining, where scaling is exhibiting diminishing returns, TTS continues to benefit from increased token generation and more reasoning trials.
As a first step, we revisit the cost model to understand the relative importance of compute and memory in TTS and how it can affect the choice of optimal model size and TTS configurations like chain-of-thought (CoT) length or number of trials.
We analyze test-time inference cost using a model that combines both compute and memory access, and define an equivalent cost function using eFLOPs, that unifies both kinds of costs in a single metric.
Definitions:
Computation Cost:
Ccomp = Cparam-compute + Cattn-compute = 2PNLout + (2rNLinDLout + rNDLout2)
Memory Access Cost:
Cmem = Cparam-mem + Cattn-mem = 2PLout + (2LinDLout + NDLout2)
Also, we consider the prompt KV cache is reused across N reasoning trials.
eFLOPs scales memory cost by the arithmetic intensity of the hardware to unify compute and memory costs in a single metric.
where I is the arithmetic intensity of the hardware (e.g., 562.5 for NVIDIA B200).
In real-world deployment, model parameters are amortized across large batches, rendering Cparam-mem insignificant.
Key Insight: In long chains-of-thought (CoTs), attention-related costs dominate. We define the ratio:
When Lout ≥ 4096, Φ can exceed 10–1000×, indicating that memory-bound attention dominates over parameter-bound compute.
We compare the TTS scaling obtained from existing FLOPs-based analysis and our eFLOPs-based analysis. For each cost model, we identify the most optimal configuration for each individual task and average their accuracy scores under the same cost budget (FLOPs or eFLOPs).
The “smallness” of smaller models is deceptive. As shown in Figure 4a, their KV cache footprint can be significantly large—even larger in proportion than that of much bigger models. This issue is exacerbated by the quadratic cost dependency on generation length, which the traditional FLOPs-based cost model fails to capture.
The plots below highlight this discrepancy through an iso-cost analysis showing how cost budgets influence the optimal combination of model size and CoT length. In Figure 4b, the Iso-FLOPs contours are nearly vertical—indicating that model size plays a dominant role. In contrast, the Iso-eFLOPs contours in Figure 4c appear more horizontal—revealing that optimal generation length is more responsive to cost budgets under the eFLOPs model.
Figure 4a: KV memory trend across model sizes
Figure 4b: Iso-FLOPs cost contours
Figure 4c: Iso-eFLOPs cost contours
KV memory scales sublinearly with model size: Smaller models can have substantial memory footprints due to disproportionately large KV caches. For instance, Qwen3-0.6B needs 3.5GB of KV cache for 32K tokens, while Qwen3-32B uses only 8GB. Empirically, doubling parameters increases KV memory by only ~1.2×. Quadratic cost penalizes long generations: Under the L2D model, generation cost grows faster than model size. As a result, small models can no longer compensate for limited capacity by simply generating longer outputs—especially for complex reasoning tasks.
Our cost analysis hints at the importance of attention efficiency in TTS. Our sparse scaling law studies the effectiveness of attention sparsity and how it can further reshape the dense TTS pareto frontier. To see the full potential of sparse attention, we first study oracle top-k attention.
Figure 5a: Best-of-N scaling comparison between oracle top-k and dense attention
Figure 5b: Best-of-N using Qwen3-30B-A3B
Figure 5c: Best-of-N using Qwen3-32B
Figure 5d: Long-CoT scaling comparison between oracle top-k and dense attention
Figure 5e: Long-CoT using Qwen3-30B-A3B
Figure 5f: Long-CoT using Qwen3-32B
Figure 5: Figures 5a and 5d show that sparse attention significantly improves performance and scalability, enabling higher accuracy at lower cost budgets. Figures 5b, 5c, 5e, and 5f highlight that both Qwen3 30B-A3B and 32B models benefit from sparsity, with 50–60 percentage point gains in low-cost settings and consistent ~5 point improvements even in high-cost regimes. Notably, sparse models reach these performance levels at much lower costs. For reference, 105 Tera-eFLOPs is 22 seconds of B200 usage at 100% utilization.
We investigate the tradeoff between KV budget B and generation tokens. For Best-of-N, we analyze how the optimal KV budget and the number of generated tokens scale with cost across N reasoning trials.
Our analysis reveals a consistent trend: allocating additional compute toward generating more tokens is generally more effective than expanding the KV cache.
In Best-of-N frontier, doubling the cost leads to only a 1.18× increase in KV budget, compared to a 1.74× increase in total generated tokens. This trend indicates that, as more computing is invested at test time, high sparsity becomes increasingly critical to fully leverage the benefits of these strategies.Figure 6: Tradeoff Between Generated Tokens and KV Budget in Best-of-N frontier.
To understand the importance of attention sparsity, we include KV sparsification into the list of TTS variables. Our holistic analysis reveals that,
With increasing cost budget, it is more beneficial to scale generation tokens than KV cache budget.
In other words, optimal KV cache budget changes very slowly with increasing cost budget. Consequently, it allows us to generate substantiallymore tokens at the same cost budget, resulting in higher performance. Importantly, the original quadratic cost scaling with generation length is replaced by a linear dependency, significantly reducing cost for longer generations.Figure 7: Compared to the dense scaling, small models (0.6B, 1.7B, 4B) are more effective with sparse attention. In other words, they occupy more space in the Pareto Frontier
We explore two tractable alternatives to oracle top-k attention: block top-k attention and sliding window attention. Although sliding window attention is easier to implement and has zero search overhead, its performance is extremely poor. Block top-k attention demonstrates comparable scaling to the oracle top-k attention, improving accuracy by 45 points in the low-cost regime and achieving equivalent accuracy while using 8.58X fewer resources compared to dense attention.
Figure 8a: Sparse algorithm comparison
Figure 8b: Block top-k Best-of-N scaling
Figure 8c: Block top-k Long-CoT scaling
Figure 8: (a) illustrates how block top-k attention scaling closely follows the oracle top-k scaling for the Qwen3-8B model on AIME24 tasks. (b) and (c) illustrate the Pareto frontiers of block top-k attention.
In this post, we introduced the Kinetics Scaling Law, emphasizing that attention cost, not parameter count, is the dominant factor at test time, fundamentally reshaping the previous scaling law. We further demonstrated that sparse attention is crucial for achieving more scalable and effective test-time scaling. While our discussion focused on a simple sparse attention algorithm, block top-k attention, we anticipate that more advanced algorithms will approach or even outperform oracle top-k scaling. Moreover, sparse attention enables more reasoning trials and longer generations. This unlocks greater flexibility in configuring TTS strategies within a fixed resource. Overall, our work aims to contribute to the understanding of efficiency and scalability challenges in the test-time scaling era, spanning model architecture, system-level implementation, and hardware design. We highlight the central role of sparsity in addressing these challenges.
@misc{sadhukhan2025kineticsrethinkingtesttimescaling,
title={Kinetics: Rethinking Test-Time Scaling Laws},
author={Ranajoy Sadhukhan and Zhuoming Chen and Haizhong Zheng and Yang Zhou and Emma Strubell and Beidi Chen},
year={2025},
eprint={2506.05333},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.05333},
}