Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

TL;DR: Our work shows that stale data can be as informative as on-policy data if exploited properly. We introduce M2PO, second-moment trust proxy optimization. M2PO significantly reduces the fraction of clipped tokens under high staleness with maintaining stable optimization. Extensive evaluation across six model scales (1.7B–32B) shows that M2PO delivers stable off-policy training even with data stale by at least 256 model updates and matches on-policy performance.

Introduction

Reinforcement learning has been central to recent advances in large language model reasoning, but most algorithms rely on on-policy training that demands fresh rollouts at every update, limiting efficiency and scalability. Asynchronous RL systems alleviate this by decoupling rollout generation from training, yet their effectiveness hinges on tolerating large staleness in rollout data, a setting where existing methods either degrade in performance or collapse. We revisit this challenge and uncover a prosperity-before-collapse phenomenon: stale data can be as informative as on-policy data if exploited properly. Building on this insight, we introduce M2PO (Second-Moment Trust Proxy Optimization), which constrains the second moment of importance weights to suppress only extreme outliers while preserving informative updates. Notably, M2PO sharply reduces the fraction of clipped tokens under high staleness (from 1.22% to 0.006% over training), precisely masking high-variance tokens while maintaining stable optimization. Extensive evaluation across six model scales (1.7B–32B) and eight reasoning benchmarks shows that M2PO delivers stable off-policy training even with data stale by at least 256 model updates and matches on-policy performance.

Figure 1 Comparison of on-policy GRPO and off-policy training under a staleness of 256 model updates on Qwen-2.5-32B. Left: Standard GRPO suffers from degradation with stale rollouts, while removing the trust region (GRPO no TR) reveals a clear prosperity-before-collapse phenomenon. In contrast, M2PO achieves stable training and matches on-policy performance even under high staleness. Right: Token clipping ratio comparison shows that M2PO dramatically reduces clipping events compared to GRPO, while avoiding training collapse.

Why is tolerance to staleness in RL important? Most RL algorithms for LLMs rely on an on-policy setup: the model must constantly generate fresh(or limited staleness) examples to learn from. This makes training stable and reliable, but also very costly, as each update requires waiting for new rollouts to finish. To get around this inefficiency, researchers have been experimenting with asynchronous RL systems, like AREAL and SLIME. In these systems, rollouts and model training happen independently, often spread across large computing clusters. Such approaches improve resource utilization and enable training to scale more efficiently across large and heterogeneous clusters, but their effectiveness fundamentally relies on the ability of RL algorithms to tolerate rollout staleness without sacrificing stability or performance.

Study Staleness with Stale-k Training. To study the effect of stale data in reinforcement learning for large language models, we introduce Stale-$k$ RL training, where the model is updated using data generated $k$ updates earlier. In our setup, each training step performs four model updates, following configurations commonly adopted in recent work. Consequently, even stale-0 ($s=0$) training involves data with a staleness of 0–3 updates, while stale-256 ($s=256$) corresponds to staleness of 256–259 updates. During the first $k$ updates, when no stale model is yet available, the policy is trained on rollouts from the original base model. After this initialization phase, all training data comes from stale models, enabling a controlled study of how staleness influences the dynamics and effectiveness of RL training.

As shown in the above figure, we train Qwen2.5-Math-7B with GRPO under varying staleness levels and report test accuracy across eight math reasoning benchmarks. The results reveal a clear trend: as staleness increases, model performance degrades and convergence slows. In particular, low-staleness training achieves higher accuracy, whereas high-staleness training converges more slowly to lower accuracy.

Observation: Prosperity before Collapse

Prosperity before collapse: training without a trust region. To disentangle whether the performance drop stems from stale data generated by highly shifted old policies or from biases introduced by the training algorithm, we remove the trust region entirely to remove bias from the training algorithm. Surprisingly, we observe a distinct prosperity-before-collapse phenomenon. As shown in Figure 2 and above figure, although training without a trust region eventually collapses, it achieves substantially better performance prior to collapse. In fact, under stale data ($s$=256), the no-clipping setting outperforms clipped training, sometimes even matching on-policy baselines.

Pivotal token masking by $\epsilon$-clipping when training with stale data. As also discussed in recent work, $\epsilon$-clipping may inadvertently mask important tokens, preventing them from contributing useful training signals. We extend this observation to the training with stalenss setting and show that the problem becomes significantly more severe when training with stale data, since larger staleness induces a greater mismatch between the behavior and target policies. As illustrated in the above figure, the clipping ratio increases sharply under large staleness ($s=256$), while remaining much lower in the on-policy baseline.

To better understand this phenomenon, we conduct a quantitative analysis on 90 million training tokens collected during Qwen2.5-Math-7B training with staleness $256$. Specifically, we gather all training tokens generated between 800 and 1200 model updates, ensuring the model is already in a stable training phase but before convergence. Above figure (right) shows a clear trend: as $|r-1|$ increases, the average token entropy also rises. This indicates that $\epsilon$-clipping disproportionately prunes high-entropy tokens, which are typically the most informative for model improvement. Consequently, clipping under stale data leads to degraded performance. This observation reveals a dilemma: while high-entropy tokens are crucial for learning progress, they also introduce instability in the off-policy setting, which motivates our key research question:

Can a more accurate and adaptive trust region strategy preserve the benefits of stale data while ensuring stable training?

M2PO: Second-Moment Trust Policy Optimization

The main source of instability in off-policy RL lies in the distributional mismatch between the behavior policy that generates training data and the current policy being optimized. As the divergence between these two distributions grows, importance sampling corrections produce high-variance gradient estimates, leading to noisy and unreliable updates. Our motivation is therefore to constrain the distributional gap between the behavior policy and the current policy at the batch level, directly coupling the constraint with model updates while preventing over-constraining of token-level variations.

To achieve this goal, we propose M2 metric to measure the distributional gap between the behavior policy and the current policy:

There are two key advantages of using the above M2 metric for trust region constraint: First, each per-token estimate is always non-negative, so the constraint can be reliably applied even when $r>1$. Second, while the batch KL only measures the mean shift between policies, M2 also reflects the variance of importance weights. This makes it more sensitive to outliers and noisy tokens with extreme ratios $r_i$.

To maintain training stability, M2PO applies a masking strategy that selectively excludes tokens until the batch-level M2 of the remaining tokens falls below a predefined threshold $\tau_{M_2}$¹. Finally, with the result mask $\boldsymbol{M}$, we update the policy by maximizing the following objective:

¹Importantly, we observe that $\tau_{M_2}$ is not a sensitive hyperparameter. Across all our experiments, we consistently set $\tau_{M_2} = 0.04$, and this single setting proved effective for stabilizing training in all training scenarios.

Stable Off-Policy Training without Accuracy Drop

Prosperity without collapse: Stable off-policy training without performance degradation using M2PO. To verify the effectiveness of M2PO, Table 1 presents a comprehensive comparison of math reasoning performance across eight benchmarks using models from four different families and scales, ranging from 1.7B to 32B parameters. We evaluate multiple reinforcement learning methods under both on-policy and off-policy settings, including GRPO, GSPO, and our proposed M2PO. The results show that while both GRPO and GSPO often suffer significant performance drops under large staleness, M2PO consistently achieves comparable accuracy to the on-policy baseline in all training settings. Surprisingly, we notice that, in some model settings, M2PO with $s=256$ even achieves a better performance than M2PO with $s=0$. For instance, on the Qwen3-Base-1.7B model, we observe that M2PO with $s=256$ (36.6%) outperforms GRPO with $s=0$ (33.0%).

BibTeX

@article{zheng2025m2po,
    title={Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?},
    author = {Zheng, Haizhong and Zhao, Jiawei and Chen, Beidi},
    journal={arXiv preprint arXiv:2510.01161},
    year={2025}
}