Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts

TL;DR: GRESO is a lightweight pre-rollout filter that skips uninformative prompts using reward dynamics, saving RL training time without hurting accuracy.

Introduction

Reinforcement learning, such as PPO and GRPO, has powered recent breakthroughs in LLM reasoning. Scaling rollout to sample more prompts enables models to selectively use higher-quality data for training, which can stabilize RL training and improve model performance, but at the cost of significant computational overhead. In this paper, we first show that a substantial portion of this overhead can be avoided by skipping uninformative prompts before rollout. Our analysis of reward dynamics reveals a strong temporal consistency in prompt value: prompts that are uninformative in one epoch of training are likely to remain uninformative in near future epochs. Based on these insights, we propose GRESO (GRPO with Efficient Selective Rollout), an online, lightweight pre-rollout filtering algorithm that predicts and skips uninformative prompts using reward training dynamics. By evaluating GRESO on a broad range of math reasoning benchmarks and models, like Qwen2.5-Math-1.5B/7B and DeepSeek-R1-Distill-Qwen-1.5B, we show that GRESO achieves up to 2.4× wall-clock time speedup in rollout and up to 2.0× speedup in total training time without accuracy degradation.

Rollout Scaling: Better Performance with More Rollouts

Scaling computational resources to sample responses for more prompts at this rollout stage can enhance reinforcement learning, which allows models to selectively utilize higher-quality data and thus train models with better converged performance. However, scaling up rollouts introduces significant computational overhead, as rollout remains a major bottleneck in RL training. For instance, as shown in the right figures, filtering out uninformative examples¹ and resampling to fill the batch with effective data,also known as Dynamic Sampling (DS), can improve model performance, but it comes at the cost of significantly increased rollout overhead.

Motivated by this challenge, we aim to investigate the following research question in this work:

How can we perform more selective rollouts—focusing on sampling more valuable prompts—to make this scaling more efficient?

¹ In GRPO, many examples yield identical rewards across all responses, resulting in zero advantage and thus contributing no learning signal during training.

More Efficient Rollout Scaling with GRESO

In this paper, we aim to design an efficient selective rollout strategy for LLM RL to make rollout scaling more efficient. We begin by analyzing the training dynamics of prompts across epochs and observe a strong temporal consistency across different training epochs. In particular, prompts that yield zero advantage in one epoch are more likely to do so in future epochs as well. This temporal correlation suggests that historical reward dynamics can be leveraged to predict and preemptively skip zero-variance examples before rollout. Building on these observations, we propose GRESO (GRPO with Efficient Selective Rollout), an online efficient pre-rollout filtering algorithm that reduces rollout cost by selectively skipping prompts predicted to be zero-variance. Instead of performing filtering after rollout, GRESO estimates a skipping probability for each prompt based on its reward dynamics during training prior to the rollout stage, significantly reducing prompt selection overhead and making the rollout scaling more efficient.

As shown in the above figure, compared to the baseline method (Dynamic Sampling), our approach (GRESO) reduces rollout overhead by up to 2x while achieving comparable training performance, improving the efficiency of rollout scaling. (We train Qwen2.5-Math-1.5B/7B on the DAPO + MATH dataset and evaluate them on five math reasoning benchmarks: MATH500, AMC, Gaokao, Minerva, and Olympiad Bench.)

GRESO: GRPO with Efficient Selective Rollout

Observation: Temporal Correlation of Prompts across Epochs

Training data typically exhibits strong temporal correlations across epochs. We hypothesize that zero-variance prompts in GRPO training similarly have such strong correlations in their training dynamics, enabling opportunities for more efficient identification of these prompts prior to the rollout stage. To test this hypothesis, we conduct a study on the temporal correlation of zero-variance prompts in GRPO training. Specifically, we train Qwen2.5-Math-7B with GRPO and measure two probabilities to study the temporal correlation of zero-variance prompts: 1) P(Previous|Current): The probability that a prompt identified as zero-variance in the current epoch was also zero-variance in any previous epoch. 2) P(Current|Previous): The probability that a prompt identified as zero-variance in any previous epoch remains zero-variance in the current epoch.

The results shown in the above Figure (a) indicate that zero-variance prompts exhibit strong temporal correlations throughout training. We have two key observations:

1) Prompts previously identified as zero-variance are likely to remain zero-variance:
The P(Previous|Current) curve shows that the majority of zero-variance prompts in a given epoch (e.g., over 90%) were also identified as zero-variance in earlier epochs.
2) Some zero-variance prompts can become effective again in future epochs:
The P(Current|Previous) curve shows that approximately 20% of prompts previously labeled as zero-variance become effective prompts that contribute to training again. This suggests that, rather than statically pruning zero-variance prompts, it is beneficial to retain some degree of exploration to help retain potentially valuable prompts.

GRESO with Probabilistic Pre-rollout Prompt Filtering

Based on the above observations, as shown in the above Figure (b), we propose a probabilistic filtering strategy based on reward dynamics observed during training. Rather than deterministically discarding prompts that previously yielded identical (i.e., zero-variance) rewards, we assign each prompt a probability of being filtered—this probability increases with the number of recent consecutive rollouts where the prompt showed zero-variance. Concretely, for each prompt, we track how many times in a row it has produced zero-variance responses in the most recent epochs. The more consecutive times this occurs, the more likely the prompt will be skipped in the current rollout. However, we always retain a minimum exploration probability, ensuring that even frequently zero-variance prompts have a small chance of being re-sampled. This approach skips uninformative prompts to exploit training efficiency, while occasionally revisiting them to maintain exploration.

Comparable Performance with Fewer Rollouts

No performance drop with up to 3.35x fewer rollouts

To verify the effectiveness of GRESO, we present a comprehensive evaluation of GRESO and Dynamic Sampling (DS), which filters out zero-variance examples and resamples to fill the batch with effective data, across six math reasoning benchmarks, using three different model settings in the above table. The models are trained on either the DAPO + MATH dataset (DM) or the Open R1 subset (OR1). We report both the performance and the number of rollouts from the checkpoint that achieves the best average performance across six benchmarks. Across all training settings, GRESO achieves comparable accuracy as DS, while significantly reducing the number of rollout samples—achieving up to 3.35× fewer rollouts. For example, on Qwen2.5-Math-7B trained on the DM dataset, GRESO achieves a comparable average accuracy to DS (57.5% vs. 57.8%), while reducing the number of rollouts from 13.1M to 6.3M. These results demonstrate that GRESO maintains performance while substantially lowering the cost on rollouts. Similar improvements are observed across other evaluation settings.

Up to 2.0x wall-clock time speed-up in training

To better understand the efficiency of our proposed methods, we report the detailed end-to-end training time breakdown for different stages: rollout, actor model update, and other overheads (e.g., reference model and advantage calculation). Qwen2.5-Math-1.5B is trained on 4×H100 GPUs, while the other two models are trained on 8×H100 GPUs. The right table compares the training time breakdown between GRESO and Dynamic Sampling for models trained on the DAPO + MATH dataset. For all three models, GRESO significantly reduces rollout time—achieving up to 2.4× speedup in rollout and 2.0× speedup in total training time compared to DS. For instance, on Qwen2.5-Math-7B, GRESO reduces rollout time from 155.9 hours to 65.5 hours, cutting overall training time from 178.0 to 88.3 hours.

Case Study: Selection Dynamics

Selection Dynamics of different prompts in GRESO. Each row is a prompt, and each column is an epoch. We present a case study illustrating how GRESO selects or skips prompts over training epochs. We observe that very easy prompts tend to remain easy throughout training; although frequently skipped, GRESO still occasionally selects them to ensure a minimal level of exploration. For prompts of moderate difficulty, as the model becomes stronger over time, these prompts gradually become easier and are increasingly skipped. In contrast, some hard prompts become solvable~(i.e., effective prompts) in later epochs or even easy prompts. However, certain hard prompts remain unsolved throughout training.

BibTeX

@article{zheng2025greso,
    title={Act Only When It Pays: Efficient Reinforcement Learning for LLM Reasoning via Selective Rollouts},
    author = {Zheng, Haizhong and Zhou, Yang and Bartoldson, Brian R. and Kailkhura, Bhavya and Lai, Fan and Zhao, Jiawei and Chen, Beidi},
    journal={arXiv preprint arXiv:2506.02177},
    year={2025}
}