TL;DR: We introduce Kinetics, which challenges the traditional test-time scaling (TTS) laws by adopting a practical efficiency perspective. It reveals that prior compute-optimal approaches overlook major key-value memory access bottlenecks in various TTS strategies. By jointly considering memory and compute, the Kinetics scaling law shows that it is more efficient to scale model size up to a threshold before investing more compute in test-time scaling. Additionally, Kinetics promotes sparse attention to establish an even better scalability.
The existing compute-optimal test-time scaling law often favors smaller models with more test-time compute. We revisit this paradigm from a practical efficiency perspective, uncovering that the effectiveness of small models is often overestimated due to overlooked key-value (KV) memory bottlenecks introduced by inference strategies such as Best-of-N or long CoT chains.
To address this, we propose the Kinetics Scaling Law, which jointly considers both compute and memory costs. Unlike prior work, it shows that test-time compute is best allocated to scaling model size—up to a threshold (e.g., 14B parameters for Qwen3)—before increasing generation length. (See Figure 1)
Furthermore, we demonstrate that sparse attention unlocks new scaling opportunities by mitigating KV memory overhead, enabling longer generations and more parallel reasoning trials within the same budget. This leads to substantial gains in test-time accuracy and efficiency. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. (See Figure 2). These results suggest that sparse attention is essential for realizing the full potential of test-time scaling because, unlike training—where parameter scaling saturates, test-time accuracy continues to improve through increased generation.
These insights form the foundation of Kinetics, a new perspective on scaling that aligns resource allocation more closely with real-world inference constraints.
As a first step, we revisit the cost model to understand the relative importance of compute and memory in TTS and how it can affect the choice of optimal model size and TTS configurations like chain-of-thought (CoT) length or number of trials.
We analyze test-time inference cost using a model that combines both compute and memory access, and define an equivalent cost function using eFLOPs, that unifies both kinds of costs in a single metric.
Definitions:
Computation Cost:
Ccomp = Cparam-compute + Cattn-compute = 2PNLout + (2rNLinDLout + rNDLout2)
Memory Access Cost:
Cmem = Cparam-mem + Cattn-mem = 2PLout + (2LinDLout + NDLout2)
Also, we consider the prompt KV cache is reused across N reasoning trials.
eFLOPs scales memory cost by the arithmetic intensity of the hardware to unify compute and memory costs in a single metric.
where I is the arithmetic intensity of the hardware (e.g., 562.5 for NVIDIA B200).
In real-world deployment, model parameters are amortized across large batches, rendering Cparam-mem insignificant.
Key Insight: In long chains-of-thought (CoTs), attention-related costs dominate. We define the ratio:
When Lout ≥ 4096, Φ can exceed 10–1000×, indicating that memory-bound attention dominates over parameter-bound compute.
We compare the TTS scaling obtained from existing FLOPs-based analysis and our eFLOPs-based analysis. For each cost model, we identify the most optimal configuration for each individual task and average their accuracy scores under the same cost budget (FLOPs or eFLOPs).
The “smallness” of smaller models is deceptive. As shown in Figure 4a, their KV cache footprint can be significantly large—even larger in proportion than that of much bigger models. This issue is exacerbated by the quadratic cost dependency on generation length, which the traditional FLOPs-based cost model fails to capture.
The plots below highlight this discrepancy through an iso-cost analysis showing how cost budgets influence the optimal combination of model size and CoT length. In Figure 4b, the Iso-FLOPs contours are nearly vertical—indicating that model size plays a dominant role. In contrast, the Iso-eFLOPs contours in Figure 4c appear more horizontal—revealing that optimal generation length is more responsive to cost budgets under the eFLOPs model.
Figure 4a: KV memory trend across model sizes
Figure 4b: Iso-FLOPs cost contours
Figure 4c: Iso-eFLOPs cost contours
KV memory scales sublinearly with model size: Smaller models can have substantial memory footprints due to disproportionately large KV caches. For instance, Qwen3-0.6B needs 3.5GB of KV cache for 32K tokens, while Qwen3-32B uses only 8GB. Empirically, doubling parameters increases KV memory by only ~1.2×. Quadratic cost penalizes long generations: Under the L2D model, generation cost grows faster than model size. As a result, small models can no longer compensate for limited capacity by simply generating longer outputs—especially for complex reasoning tasks.
Our cost analysis hints at the importance of attention efficiency in TTS. Our sparse scaling law studies the effectiveness of attention sparsity and how it can further reshape the dense TTS pareto frontier. To see the full potential of sparse attention, we first study oracle top-k attention.
Figure 5a: Best-of-N scaling comparison between oracle top-k and dense attention
Figure 5b: Best-of-N using Qwen3-8B
Figure 5c: Best-of-N using Qwen3-32B
Figure 5d: Long-CoT scaling comparison between oracle top-k and dense attention
Figure 5e: Long-CoT using Qwen3-8B
Figure 5f: Long-CoT using Qwen3-32B
Figure 5: Figures 5a and 5d show that sparse attention significantly improves performance and scalability, enabling higher accuracy at lower compute budgets. Figures 5b, 5c, 5e, and 5f highlight that both 8B and 32B models benefit from sparsity, with 50–60 percentage point gains in low-cost settings and consistent ~5 point improvements even in high-cost regimes. Notably, sparse models reach these performance levels at much lower costs —for context, 105 Tera-eFLOPs is just 22 seconds of B200 usage.
To understand the importance of attention sparsity, we include KV sparsification into the list of TTS variables. Our holistic analysis reveals that,
With increasing cost budget, it is more beneficial to scale generation tokens than KV cache budget.
In other words, optimal KV cache budget changes very slowly with increasing cost budget. Consequently, it allows us to generate substantiallymore tokens at the same cost budget, resulting in higher performance. Importantly, the original quadratic cost scaling with generation length is replaced by a linear dependency, significantly reducing cost for longer generations.Figure 6: Compared to the dense scaling, small models (0.6B, 1.7B, 4B) are more effective with sparse attention. In other words, they occupy more space in the Pareto Frontier
We explore two tractable alternatives to oracle top-k attention: block top-k attention and sliding window attention. Although sliding window attention is easier to implement and has zero search overhead, its performance is extremely poor. Block top-k attention demonstrates comparable scaling to the oracle top-k attention, improving accuracy by 45 points in the low-cost regime and achieving equivalent accuracy while using 8.58X fewer resources compared to dense attention..
Figure 7a: Sparse algorithm comparison
Figure 7b: Block top-k Best-of-N scaling
Figure 7c: Block top-k Long-CoT scaling
Figure 7: (a) illustrates how block top-k attention scaling closely follows the oracle top-k scaling for the Qwen3-8B model on AIME24 tasks. (b) and (c) illustrate the Pareto frontiers of block top-k attention.
In this post, we introduced the Kinetics Scaling Law, emphasizing that attention cost, not parameter count, is the dominant factor at test time, fundamentally reshaping the traditional scaling landscape. We further demonstrated that sparse attention is crucial for achieving more effective and scalable test-time scaling. While our discussion focused on a simple sparse attention algorithm like block top-k attention, we anticipate that more advanced algorithms will approach or even outperform oracle top-k scaling. Moreover, sparse attention drastically reduces inference cost, enabling more reasoning trials and longer generations. This unlocks greater flexibility in configuring TTS strategies within a fixed resource. Overall, we believe the Kinetics Scaling Law serves as a guiding principle for end-to-end design in agent deployment, model architectures, LLM serving systems, and hardware.
@misc{chen2024magicdecbreakinglatencythroughputtradeoff,
title={Kinetics: Rethinking Test-Time Scaling Laws},
author={Ranajoy Sadhukhan and Zhuoming Chen and Haizhong Zheng and Yang Zhou and Emma Strubell and Beidi Chen},
year={2024},
eprint={2408.11049},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2408.11049},
}