Magic Wand Icon

Kinetics: Rethinking Test-Time Scaling Laws



Carnegie Mellon University
*Indicates Equal Contribution

Kinetics Scaling Law

Kinetics Scaling
Figure 1: On AIME24, compute-optimal model choices can be up to 3× costlier than those guided by the Kinetics scaling law.

Sparse Scaling Advantage

Sparse Scaling
Figure 2: Sparse attention unlocks stronger scaling by enabling longer sequences and more parallel samples.

TL;DR: We introduce Kinetics, which challenges the traditional test-time scaling (TTS) laws by adopting a practical efficiency perspective. It reveals that prior compute-optimal approaches overlook major key-value memory access bottlenecks in various TTS strategies. By jointly considering memory and compute, the Kinetics scaling law shows that it is more efficient to scale model size up to a threshold before investing more compute in test-time scaling. Motivated by this, we propose a new scaling paradigm centered on sparse attention, which fundamentally reshapes the scaling law. According to our Sparse Kinetics scaling law, sparse attention significantly enhances the scalability and performance of TTS and becomes increasingly valuable in high-cost scenarios.

  Introduction

The existing compute-optimal test-time scaling law often favors smaller models with more test-time compute. We revisit this paradigm from a practical efficiency perspective, uncovering that the effectiveness of small models is often overestimated due to overlooked key-value (KV) memory bottlenecks introduced by inference strategies such as Best-of-N or long CoT chains.

To address this, we propose the Kinetics Scaling Law, which jointly considers both compute and memory costs. Unlike prior work, it shows that test-time compute is best allocated to scaling model size—up to a threshold (e.g., 14B parameters for Qwen3)—before increasing generation length. (See Figure 1)

Furthermore, we demonstrate that sparse attention unlocks new scaling opportunities by mitigating KV memory overhead, enabling longer generations and more parallel reasoning trials within the same budget. This leads to substantial gains in test-time accuracy and efficiency. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. (See Figure 2). Moreover, according to our Sparse Kinetics scaling law, computational resources are best allocated to test-time strategies rather than reducing sparsity.

We demonstrate the practicality of Sparse Kinetics using a simple block-sparse attention mechanism, which is known to be hardware-efficient and shows a promising scalability comparable to the oracle TopK attention. Block topk attention achieves up to a 3.2 to 33.3× throughput improvement on H200 GPUs.

While sparsity has traditionally been employed either for regularization in small models, or to reduce computation in over-parameterized networks, our work introduces a fundamentally different perspective: sparsity as a central enabler of efficient and scalable test-time compute. In contrast to pretraining, where scaling is exhibiting diminishing returns, TTS continues to benefit from increased token generation and more reasoning trials.

Note Box
Advanced test-time strategies shift evaluation from token-centric metrics (e.g., perplexity, latency) to task-level throughput—the number of tasks completed per unit time. This shift is especially relevant for reasoning tasks, where intermediate steps may vary widely depending on the strategy, yet the ultimate utility hinges almost entirely on the correctness of the final output. In contrast, traditional tasks like chat completions focus on token-level quality and throughput. This paradigm shift paves the way for novel algorithmic approaches, such as sparse scaling.

  Rethinking Test-Time Scaling Law

 A Holistic Cost Model for TTS

As a first step, we revisit the cost model to understand the relative importance of compute and memory in TTS and how it can affect the choice of optimal model size and TTS configurations like chain-of-thought (CoT) length or number of trials.

View Full Cost Model Derivation

We analyze test-time inference cost using a model that combines both compute and memory access, and define an equivalent cost function using eFLOPs, that unifies both kinds of costs in a single metric.

Definitions:

  • P: model parameter size
  • N: number of reasoning trials
  • Lin: input (prefix) length
  • Lout: outout (generation) length
  • D: attention head dimension
  • r: GQA group size
  • I: arithmetic intensity of hardware (FLOPs per GB/s bandwidth)

Computation Cost:

Ccomp = Cparam-compute + Cattn-compute = 2PNLout + (2rNLinDLout + rNDLout2)

Memory Access Cost:

Cmem = Cparam-mem + Cattn-mem = 2PLout + (2LinDLout + NDLout2)

Also, we consider the prompt KV cache is reused across N reasoning trials.

eFLOPs (Equivalent FLOPs)

eFLOPs scales memory cost by the arithmetic intensity of the hardware to unify compute and memory costs in a single metric.

eFLOPs = Ccomp + I × Cmem

where I is the arithmetic intensity of the hardware (e.g., 562.5 for NVIDIA B200).

Final Cost Model:

In real-world deployment, model parameters are amortized across large batches, rendering Cparam-mem insignificant.

CTTS = 2NPLout + 2rNLinDLout + rNDLout2 + 2ILinDLout + INDLout2

Key Insight: In long chains-of-thought (CoTs), attention-related costs dominate. We define the ratio:

Φ = [2rLinD + (rD + ID)Lout] / 2P

When Lout ≥ 4096, Φ can exceed 10–1000×, indicating that memory-bound attention dominates over parameter-bound compute.

Implication: As generation length increases, inference bottlenecks shift from linear LoutP to quadratic Lout2D attention terms — eliciting the need for attention efficiency.


  Kinetics Scaling Law

We compare the TTS scaling obtained from existing FLOPs-based analysis and our eFLOPs-based analysis. For each cost model, we identify the most optimal configuration for each individual task and average their accuracy scores under the same cost budget (FLOPs or eFLOPs).

Pareto Frontier Comparison
Figure 3: Comparison of AIME24 pareto frontiers for long CoT-based Qwen3 TTS under FLOPs (ab) and eFLOPs (cd) cost models. Optimal model choices are highlighted in (ac), while optimal CoT lengths are shown in (bd). Similar observations hold for Best-of-N scaling.
Takeaway:
  1. A FLOPs-based cost model tends to favor smaller models with more test-time compute (e.g., more trials or longer generations), until their accuracy gains begin to plateau.
  2. In contrast, the eFLOPs-based model suggests that scaling up model size is more effective than increasing test-time compute—even in low-accuracy regimes. We find that only beyond an emergent size (14B for Qwen3 series), longer CoT begin to outperform further parameter scaling. Therefore, even under limited budgets, it is often better to choose a larger model upfront.
Expand to understand why FLOPs-based and eFLOPs-based scaling laws diverge
Why do they diverge?

The “smallness” of smaller models is deceptive. As shown in Figure 4a, their KV cache footprint can be significantly large—even larger in proportion than that of much bigger models. This issue is exacerbated by the quadratic cost dependency on generation length, which the traditional FLOPs-based cost model fails to capture.

The plots below highlight this discrepancy through an iso-cost analysis showing how cost budgets influence the optimal combination of model size and CoT length. In Figure 4b, the Iso-FLOPs contours are nearly vertical—indicating that model size plays a dominant role. In contrast, the Iso-eFLOPs contours in Figure 4c appear more horizontal—revealing that optimal generation length is more responsive to cost budgets under the eFLOPs model.

KV Trend

Figure 4a: KV memory trend across model sizes

Iso-FLOPs

Figure 4b: Iso-FLOPs cost contours

Iso-eFLOPs

Figure 4c: Iso-eFLOPs cost contours

KV memory scales sublinearly with model size: Smaller models can have substantial memory footprints due to disproportionately large KV caches. For instance, Qwen3-0.6B needs 3.5GB of KV cache for 32K tokens, while Qwen3-32B uses only 8GB. Empirically, doubling parameters increases KV memory by only ~1.2×. Quadratic cost penalizes long generations: Under the L2D model, generation cost grows faster than model size. As a result, small models can no longer compensate for limited capacity by simply generating longer outputs—especially for complex reasoning tasks.

  Kinetics Sparse Test-Time Scaling

Our cost analysis hints at the importance of attention efficiency in TTS. Our sparse scaling law studies the effectiveness of attention sparsity and how it can further reshape the dense TTS pareto frontier. To see the full potential of sparse attention, we first study oracle top-k attention.

 Sparse attention significantly enhances problem-solving performance

topk-benefits

Figure 5a: Best-of-N scaling comparison between oracle top-k and dense attention

topk-benefits

Figure 5b: Best-of-N using Qwen3-30B-A3B

topk-benefits

Figure 5c: Best-of-N using Qwen3-32B

topk-benefits

Figure 5d: Long-CoT scaling comparison between oracle top-k and dense attention

topk-benefits

Figure 5e: Long-CoT using Qwen3-30B-A3B

topk-benefits

Figure 5f: Long-CoT using Qwen3-32B


Figure 5: Figures 5a and 5d show that sparse attention significantly improves performance and scalability, enabling higher accuracy at lower cost budgets. Figures 5b, 5c, 5e, and 5f highlight that both Qwen3 30B-A3B and 32B models benefit from sparsity, with 50–60 percentage point gains in low-cost settings and consistent ~5 point improvements even in high-cost regimes. Notably, sparse models reach these performance levels at much lower costs. For reference, 105 Tera-eFLOPs is 22 seconds of B200 usage at 100% utilization.


  Sparse attention becomes increasingly valuable in high-cost scenarios

We investigate the tradeoff between KV budget B and generation tokens. For Best-of-N, we analyze how the optimal KV budget and the number of generated tokens scale with cost across N reasoning trials.

Our analysis reveals a consistent trend: allocating additional compute toward generating more tokens is generally more effective than expanding the KV cache.

In Best-of-N frontier, doubling the cost leads to only a 1.18× increase in KV budget, compared to a 1.74× increase in total generated tokens. This trend indicates that, as more computing is invested at test time, high sparsity becomes increasingly critical to fully leverage the benefits of these strategies.

topk-benefits

Figure 6: Tradeoff Between Generated Tokens and KV Budget in Best-of-N frontier.


  Can sparse attention reshape the Kinetics scaling law?

To understand the importance of attention sparsity, we include KV sparsification into the list of TTS variables. Our holistic analysis reveals that,

With increasing cost budget, it is more beneficial to scale generation tokens than KV cache budget.

In other words, optimal KV cache budget changes very slowly with increasing cost budget. Consequently, it allows us to generate substantiallymore tokens at the same cost budget, resulting in higher performance. Importantly, the original quadratic cost scaling with generation length is replaced by a linear dependency, significantly reducing cost for longer generations.

topk-benefits

Figure 7: Compared to the dense scaling, small models (0.6B, 1.7B, 4B) are more effective with sparse attention. In other words, they occupy more space in the Pareto Frontier


  Exploring tractable sparse attention algorithms

We explore two tractable alternatives to oracle top-k attention: block top-k attention and sliding window attention. Although sliding window attention is easier to implement and has zero search overhead, its performance is extremely poor. Block top-k attention demonstrates comparable scaling to the oracle top-k attention, improving accuracy by 45 points in the low-cost regime and achieving equivalent accuracy while using 8.58X fewer resources compared to dense attention.

sparse comparison trials

Figure 8a: Sparse algorithm comparison

block topk trial

Figure 8b: Block top-k Best-of-N scaling

block topk genlen

Figure 8c: Block top-k Long-CoT scaling

Figure 8: (a) illustrates how block top-k attention scaling closely follows the oracle top-k scaling for the Qwen3-8B model on AIME24 tasks. (b) and (c) illustrate the Pareto frontiers of block top-k attention.

  Conclusion and Future Work

In this post, we introduced the Kinetics Scaling Law, emphasizing that attention cost, not parameter count, is the dominant factor at test time, fundamentally reshaping the previous scaling law. We further demonstrated that sparse attention is crucial for achieving more scalable and effective test-time scaling. While our discussion focused on a simple sparse attention algorithm, block top-k attention, we anticipate that more advanced algorithms will approach or even outperform oracle top-k scaling. Moreover, sparse attention enables more reasoning trials and longer generations. This unlocks greater flexibility in configuring TTS strategies within a fixed resource. Overall, our work aims to contribute to the understanding of efficiency and scalability challenges in the test-time scaling era, spanning model architecture, system-level implementation, and hardware design. We highlight the central role of sparsity in addressing these challenges.

<i>TriForce</i>

BibTeX

@misc{sadhukhan2025kineticsrethinkingtesttimescaling,
      title={Kinetics: Rethinking Test-Time Scaling Laws}, 
      author={Ranajoy Sadhukhan and Zhuoming Chen and Haizhong Zheng and Yang Zhou and Emma Strubell and Beidi Chen},
      year={2025},
      eprint={2506.05333},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.05333}, 
}