Kinetics: Rethinking Test-Time Scaling Laws

Carnegie Mellon University
^*Indicates Equal Contribution

Introduction

The existing compute-optimal test-time scaling law often favors smaller models with more test-time compute. We revisit this paradigm from a practical efficiency perspective, uncovering that the effectiveness of small models is often overestimated due to overlooked key-value (KV) memory bottlenecks introduced by inference strategies such as Best-of-N or long CoT chains.

To address this, we propose the Kinetics Scaling Law, which jointly considers both compute and memory costs. Unlike prior work, it shows that test-time compute is best allocated to scaling model size—up to a threshold (e.g., 14B parameters for Qwen3)—before increasing generation length. (See Figure 1)

Furthermore, we demonstrate that sparse attention unlocks new scaling opportunities by mitigating KV memory overhead, enabling longer generations and more parallel reasoning trials within the same budget. This leads to substantial gains in test-time accuracy and efficiency. Empirically, we show that sparse attention models consistently outperform dense counterparts, achieving over 60 points gains in low-cost regimes and over 5 points gains in high-cost regimes for problem-solving accuracy on AIME, encompassing evaluations on state-of-the-art MoEs. (See Figure 2). These results suggest that sparse attention is essential for realizing the full potential of test-time scaling because, unlike training—where parameter scaling saturates, test-time accuracy continues to improve through increased generation.

These insights form the foundation of Kinetics, a new perspective on scaling that aligns resource allocation more closely with real-world inference constraints.

Rethinking Test-Time Scaling Law

A Holistic Cost Model for TTS

As a first step, we revisit the cost model to understand the relative importance of compute and memory in TTS and how it can affect the choice of optimal model size and TTS configurations like chain-of-thought (CoT) length or number of trials.

View Full Cost Model Derivation

We analyze test-time inference cost using a model that combines both compute and memory access, and define an equivalent cost function using eFLOPs, that unifies both kinds of costs in a single metric.

Definitions:

P: model parameter size
N: number of reasoning trials
L_in: input (prefix) length
L_out: outout (generation) length
D: attention head dimension
r: GQA group size
I: arithmetic intensity of hardware (FLOPs per GB/s bandwidth)

Computation Cost:

C_comp = C_{param-compute} + C_attn-compute = 2PNL_out + (2rNL_inDL_out + rNDL_out²)

Memory Access Cost:

C_mem = C_param-mem + C_attn-mem = 2PL_out + (2L_inDL_out + NDL_out²)

Also, we consider the prompt KV cache is reused across N reasoning trials.

eFLOPs (Equivalent FLOPs)

eFLOPs scales memory cost by the arithmetic intensity of the hardware to unify compute and memory costs in a single metric.

eFLOPs = C_comp + I × C_mem

where I is the arithmetic intensity of the hardware (e.g., 562.5 for NVIDIA B200).

Final Cost Model:

In real-world deployment, model parameters are amortized across large batches, rendering C_param-mem insignificant.

C_TTS = 2NPL_out + 2rNL_inDL_out + rNDL_out² + 2IL_inDL_out + INDL_out²

Key Insight: In long chains-of-thought (CoTs), attention-related costs dominate. We define the ratio:

Φ = [2rL_inD + (rD + ID)L_out] / 2P

When L_out ≥ 4096, Φ can exceed 10–1000×, indicating that memory-bound attention dominates over parameter-bound compute.

Implication: As generation length increases, inference bottlenecks shift from linear L_outP to quadratic L_out²D attention terms — eliciting the need for attention efficiency.

Kinetics Scaling Law

We compare the TTS scaling obtained from existing FLOPs-based analysis and our eFLOPs-based analysis. For each cost model, we identify the most optimal configuration for each individual task and average their accuracy scores under the same cost budget (FLOPs or eFLOPs).

Pareto Frontier Comparison — **Figure 3:** Comparison of AIME24 pareto frontiers for long CoT-based Qwen3 TTS under FLOPs **(ab)** and eFLOPs **(cd)** cost models. Optimal model choices are highlighted in **(ac)**, while optimal CoT lengths are shown in **(bd)**. Similar observations hold for *Best-of-N* scaling.

Takeaway:

A FLOPs-based cost model tends to favor smaller models with more test-time compute (e.g., more trials or longer generations), until their accuracy gains begin to plateau.
In contrast, the eFLOPs-based model suggests that scaling up model size is more effective than increasing test-time compute—even in low-accuracy regimes. We find that only beyond an emergent size (14B for Qwen3 series), longer CoT begin to outperform further parameter scaling. Therefore, even under limited budgets, it is often better to choose a larger model upfront.

Expand to understand why FLOPs-based and eFLOPs-based scaling laws diverge

Why do they diverge?

The “smallness” of smaller models is deceptive. As shown in Figure 4a, their KV cache footprint can be significantly large—even larger in proportion than that of much bigger models. This issue is exacerbated by the quadratic cost dependency on generation length, which the traditional FLOPs-based cost model fails to capture.

The plots below highlight this discrepancy through an iso-cost analysis showing how cost budgets influence the optimal combination of model size and CoT length. In Figure 4b, the Iso-FLOPs contours are nearly vertical—indicating that model size plays a dominant role. In contrast, the Iso-eFLOPs contours in Figure 4c appear more horizontal—revealing that optimal generation length is more responsive to cost budgets under the eFLOPs model.

Figure 4a: KV memory trend across model sizes

Figure 4b: Iso-FLOPs cost contours

Figure 4c: Iso-eFLOPs cost contours

KV memory scales sublinearly with model size: Smaller models can have substantial memory footprints due to disproportionately large KV caches. For instance, Qwen3-0.6B needs 3.5GB of KV cache for 32K tokens, while Qwen3-32B uses only 8GB. Empirically, doubling parameters increases KV memory by only ~1.2×. Quadratic cost penalizes long generations: Under the L²D model, generation cost grows faster than model size. As a result, small models can no longer compensate for limited capacity by simply generating longer outputs—especially for complex reasoning tasks.

Kinetics Sparse Test-Time Scaling

Our cost analysis hints at the importance of attention efficiency in TTS. Our sparse scaling law studies the effectiveness of attention sparsity and how it can further reshape the dense TTS pareto frontier. To see the full potential of sparse attention, we first study oracle top-k attention.

Sparse attention significantly enhances problem-solving performance

Figure 5a: Best-of-N scaling comparison between oracle top-k and dense attention

Figure 5b: Best-of-N using Qwen3-8B

Figure 5c: Best-of-N using Qwen3-32B

Figure 5d: Long-CoT scaling comparison between oracle top-k and dense attention

Figure 5e: Long-CoT using Qwen3-8B

Figure 5f: Long-CoT using Qwen3-32B

Figure 5: Figures 5a and 5d show that sparse attention significantly improves performance and scalability, enabling higher accuracy at lower compute budgets. Figures 5b, 5c, 5e, and 5f highlight that both 8B and 32B models benefit from sparsity, with 50–60 percentage point gains in low-cost settings and consistent ~5 point improvements even in high-cost regimes. Notably, sparse models reach these performance levels at much lower costs —for context, 10⁵ Tera-eFLOPs is just 22 seconds of B200 usage.

Can sparse attention reshape the Kinetics scaling law?

To understand the importance of attention sparsity, we include KV sparsification into the list of TTS variables. Our holistic analysis reveals that,

With increasing cost budget, it is more beneficial to scale generation tokens than KV cache budget.

In other words, optimal KV cache budget changes very slowly with increasing cost budget. Consequently, it allows us to generate substantiallymore tokens at the same cost budget, resulting in higher performance. Importantly, the original quadratic cost scaling with generation length is replaced by a linear dependency, significantly reducing cost for longer generations.

Figure 6: Compared to the dense scaling, small models (0.6B, 1.7B, 4B) are more effective with sparse attention. In other words, they occupy more space in the Pareto Frontier

Exploring tractable sparse attention algorithms

We explore two tractable alternatives to oracle top-k attention: block top-k attention and sliding window attention. Although sliding window attention is easier to implement and has zero search overhead, its performance is extremely poor. Block top-k attention demonstrates comparable scaling to the oracle top-k attention, improving accuracy by 45 points in the low-cost regime and achieving equivalent accuracy while using 8.58X fewer resources compared to dense attention..

Figure 7a: Sparse algorithm comparison

Figure 7b: Block top-k Best-of-N scaling

Figure 7c: Block top-k Long-CoT scaling

Figure 7: (a) illustrates how block top-k attention scaling closely follows the oracle top-k scaling for the Qwen3-8B model on AIME24 tasks. (b) and (c) illustrate the Pareto frontiers of block top-k attention.

Conclusion and Future Work

In this post, we introduced the Kinetics Scaling Law, emphasizing that attention cost, not parameter count, is the dominant factor at test time, fundamentally reshaping the traditional scaling landscape. We further demonstrated that sparse attention is crucial for achieving more effective and scalable test-time scaling. While our discussion focused on a simple sparse attention algorithm like block top-k attention, we anticipate that more advanced algorithms will approach or even outperform oracle top-k scaling. Moreover, sparse attention drastically reduces inference cost, enabling more reasoning trials and longer generations. This unlocks greater flexibility in configuring TTS strategies within a fixed resource. Overall, we believe the Kinetics Scaling Law serves as a guiding principle for end-to-end design in agent deployment, model architectures, LLM serving systems, and hardware.

BibTeX

@misc{chen2024magicdecbreakinglatencythroughputtradeoff, title={Kinetics: Rethinking Test-Time Scaling Laws}, author={Ranajoy Sadhukhan and Zhuoming Chen and Haizhong Zheng and Yang Zhou and Emma Strubell and Beidi Chen}, year={2024}, eprint={2408.11049}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2408.11049}, }

Kinetics: Rethinking Test-Time Scaling Laws

Kinetics Scaling Law

Sparse Scaling Advantage

Introduction

Rethinking Test-Time Scaling Law

A Holistic Cost Model for TTS

eFLOPs (Equivalent FLOPs)

Final Cost Model:

Kinetics Scaling Law

Why do they diverge?

Kinetics Sparse Test-Time Scaling

Sparse attention significantly enhances problem-solving performance

Can sparse attention reshape the Kinetics scaling law?

Exploring tractable sparse attention algorithms

Conclusion and Future Work

BibTeX