Magic Wand Icon

Kinetics: Rethinking Test-Time Scaling Laws



Carnegie Mellon University
*Indicates Equal Contribution

Kinetics Scaling Law

Kinetics Scaling
Figure 1: On AIME24, compute-optimal model choices can be up to 3× costlier than those guided by the Kinetics scaling law.

Sparse Scaling Advantage

Sparse Scaling
Figure 2: Sparse attention unlocks stronger scaling by enabling longer sequences and more parallel samples.

TL;DR: We introduce Kinetics, which challenges the traditional test-time scaling (TTS) laws by adopting a practical efficiency perspective. It reveals that prior compute-optimal approaches overlook major key-value memory access bottlenecks in various TTS strategies. By jointly considering memory and compute, the Kinetics scaling law shows that it is more efficient to scale model size up to a threshold before investing more compute in test-time scaling. Additionally, Kinetics promotes sparse attention to establish an even better scalability.

  Introduction

The existing compute-optimal test-time scaling law often favors smaller models with more test-time compute. We revisit this paradigm from a practical efficiency perspective, uncovering that the effectiveness of small models is often overestimated due to overlooked key-value (KV) memory bottlenecks introduced by inference strategies such as Best-of-N or long CoT chains.

To address this, we propose the Kinetics Scaling Law, which jointly considers both compute and memory costs. Unlike prior work, it shows that test-time compute is best allocated to scaling model size—up to a threshold (e.g., 14B parameters for Qwen3)—before increasing generation length. (See Figure 1)

Furthermore, we demonstrate that sparse attention unlocks new scaling opportunities by mitigating KV memory overhead, enabling longer generations and more parallel reasoning trials within the same budget. This leads to substantial gains in test-time accuracy and efficiency. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. (See Figure 2). These results suggest that sparse attention is essential for realizing the full potential of test-time scaling because, unlike training—where parameter scaling saturates, test-time accuracy continues to improve through increased generation.

These insights form the foundation of Kinetics, a new perspective on scaling that aligns resource allocation more closely with real-world inference constraints.

  Rethinking Test-Time Scaling Law

 A Holistic Cost Model for TTS

As a first step, we revisit the cost model to understand the relative importance of compute and memory in TTS and how it can affect the choice of optimal model size and TTS configurations like chain-of-thought (CoT) length or number of trials.

View Full Cost Model Derivation

We analyze test-time inference cost using a model that combines both compute and memory access, and define an equivalent cost function using eFLOPs, that unifies both kinds of costs in a single metric.

Definitions:

  • P: model parameter size
  • N: number of reasoning trials
  • Lin: input (prefix) length
  • Lout: outout (generation) length
  • D: attention head dimension
  • r: GQA group size
  • I: arithmetic intensity of hardware (FLOPs per GB/s bandwidth)

Computation Cost:

Ccomp = Cparam-compute + Cattn-compute = 2PNLout + (2rNLinDLout + rNDLout2)

Memory Access Cost:

Cmem = Cparam-mem + Cattn-mem = 2PLout + (2LinDLout + NDLout2)

Also, we consider the prompt KV cache is reused across N reasoning trials.

eFLOPs (Equivalent FLOPs)

eFLOPs scales memory cost by the arithmetic intensity of the hardware to unify compute and memory costs in a single metric.

eFLOPs = Ccomp + I × Cmem

where I is the arithmetic intensity of the hardware (e.g., 562.5 for NVIDIA B200).

Final Cost Model:

In real-world deployment, model parameters are amortized across large batches, rendering Cparam-mem insignificant.

CTTS = 2NPLout + 2rNLinDLout + rNDLout2 + 2ILinDLout + INDLout2

Key Insight: In long chains-of-thought (CoTs), attention-related costs dominate. We define the ratio:

Φ = [2rLinD + (rD + ID)Lout] / 2P

When Lout ≥ 4096, Φ can exceed 10–1000×, indicating that memory-bound attention dominates over parameter-bound compute.

Implication: As generation length increases, inference bottlenecks shift from linear LoutP to quadratic Lout2D attention terms — eliciting the need for attention efficiency.


  Kinetics Scaling Law

We compare the TTS scaling obtained from existing FLOPs-based analysis and our eFLOPs-based analysis. For each cost model, we identify the most optimal configuration for each individual task and average their accuracy scores under the same cost budget (FLOPs or eFLOPs).

Pareto Frontier Comparison
Figure 3: Comparison of AIME24 pareto frontiers for long CoT-based Qwen3 TTS under FLOPs (ab) and eFLOPs (cd) cost models. Optimal model choices are highlighted in (ac), while optimal CoT lengths are shown in (bd). Similar observations hold for Best-of-N scaling.
Takeaway:
  1. A FLOPs-based cost model tends to favor smaller models with more test-time compute (e.g., more trials or longer generations), until their accuracy gains begin to plateau.
  2. In contrast, the eFLOPs-based model suggests that scaling up model size is more effective than increasing test-time compute—even in low-accuracy regimes. We find that only beyond an emergent size (14B for Qwen3 series), longer CoT begin to outperform further parameter scaling. Therefore, even under limited budgets, it is often better to choose a larger model upfront.
Expand to understand why FLOPs-based and eFLOPs-based scaling laws diverge
Why do they diverge?

The “smallness” of smaller models is deceptive. As shown in Figure 4a, their KV cache footprint can be significantly large—even larger in proportion than that of much bigger models. This issue is exacerbated by the quadratic cost dependency on generation length, which the traditional FLOPs-based cost model fails to capture.

The plots below highlight this discrepancy through an iso-cost analysis showing how cost budgets influence the optimal combination of model size and CoT length. In Figure 4b, the Iso-FLOPs contours are nearly vertical—indicating that model size plays a dominant role. In contrast, the Iso-eFLOPs contours in Figure 4c appear more horizontal—revealing that optimal generation length is more responsive to cost budgets under the eFLOPs model.

KV Trend

Figure 4a: KV memory trend across model sizes

Iso-FLOPs

Figure 4b: Iso-FLOPs cost contours

Iso-eFLOPs

Figure 4c: Iso-eFLOPs cost contours

KV memory scales sublinearly with model size: Smaller models can have substantial memory footprints due to disproportionately large KV caches. For instance, Qwen3-0.6B needs 3.5GB of KV cache for 32K tokens, while Qwen3-32B uses only 8GB. Empirically, doubling parameters increases KV memory by only ~1.2×. Quadratic cost penalizes long generations: Under the L2D model, generation cost grows faster than model size. As a result, small models can no longer compensate for limited capacity by simply generating longer outputs—especially for complex reasoning tasks.

  Kinetics Sparse Test-Time Scaling

Our cost analysis hints at the importance of attention efficiency in TTS. Our sparse scaling law studies the effectiveness of attention sparsity and how it can further reshape the dense TTS pareto frontier. To see the full potential of sparse attention, we first study oracle top-k attention.

 Sparse attention significantly enhances problem-solving performance

topk-benefits

Figure 5a: Best-of-N scaling comparison between oracle top-k and dense attention

topk-benefits

Figure 5b: Best-of-N using Qwen3-8B

topk-benefits

Figure 5c: Best-of-N using Qwen3-32B

topk-benefits

Figure 5d: Long-CoT scaling comparison between oracle top-k and dense attention

topk-benefits

Figure 5e: Long-CoT using Qwen3-8B

topk-benefits

Figure 5f: Long-CoT using Qwen3-32B


Figure 5: Figures 5a and 5d show that sparse attention significantly improves performance and scalability, enabling higher accuracy at lower compute budgets. Figures 5b, 5c, 5e, and 5f highlight that both 8B and 32B models benefit from sparsity, with 50–60 percentage point gains in low-cost settings and consistent ~5 point improvements even in high-cost regimes. Notably, sparse models reach these performance levels at much lower costs —for context, 105 Tera-eFLOPs is just 22 seconds of B200 usage.


  Can sparse attention reshape the Kinetics scaling law?

To understand the importance of attention sparsity, we include KV sparsification into the list of TTS variables. Our holistic analysis reveals that,

With increasing cost budget, it is more beneficial to scale generation tokens than KV cache budget.

In other words, optimal KV cache budget changes very slowly with increasing cost budget. Consequently, it allows us to generate substantiallymore tokens at the same cost budget, resulting in higher performance. Importantly, the original quadratic cost scaling with generation length is replaced by a linear dependency, significantly reducing cost for longer generations.

topk-benefits

Figure 6: Compared to the dense scaling, small models (0.6B, 1.7B, 4B) are more effective with sparse attention. In other words, they occupy more space in the Pareto Frontier


  Exploring tractable sparse attention algorithms

We explore two tractable alternatives to oracle top-k attention: block top-k attention and sliding window attention. Although sliding window attention is easier to implement and has zero search overhead, its performance is extremely poor. Block top-k attention demonstrates comparable scaling to the oracle top-k attention, improving accuracy by 45 points in the low-cost regime and achieving equivalent accuracy while using 8.58X fewer resources compared to dense attention..

sparse comparison trials

Figure 7a: Sparse algorithm comparison

block topk trial

Figure 7b: Block top-k Best-of-N scaling

block topk genlen

Figure 7c: Block top-k Long-CoT scaling

Figure 7: (a) illustrates how block top-k attention scaling closely follows the oracle top-k scaling for the Qwen3-8B model on AIME24 tasks. (b) and (c) illustrate the Pareto frontiers of block top-k attention.

  Conclusion and Future Work

In this post, we introduced the Kinetics Scaling Law, emphasizing that attention cost, not parameter count, is the dominant factor at test time, fundamentally reshaping the traditional scaling landscape. We further demonstrated that sparse attention is crucial for achieving more effective and scalable test-time scaling. While our discussion focused on a simple sparse attention algorithm like block top-k attention, we anticipate that more advanced algorithms will approach or even outperform oracle top-k scaling. Moreover, sparse attention drastically reduces inference cost, enabling more reasoning trials and longer generations. This unlocks greater flexibility in configuring TTS strategies within a fixed resource. Overall, we believe the Kinetics Scaling Law serves as a guiding principle for end-to-end design in agent deployment, model architectures, LLM serving systems, and hardware.

<i>TriForce</i>

BibTeX

@misc{chen2024magicdecbreakinglatencythroughputtradeoff,
              title={Kinetics: Rethinking Test-Time Scaling Laws}, 
              author={Ranajoy Sadhukhan and Zhuoming Chen and Haizhong Zheng and Yang Zhou and Emma Strubell and Beidi Chen},
              year={2024},
              eprint={2408.11049},
              archivePrefix={arXiv},
              primaryClass={cs.CL},
              url={https://arxiv.org/abs/2408.11049}, 
        }