Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning

Zhuoming Chen* Hongyi Liu* Yang Zhou* Haizhong Zheng Beidi Chen

Carnegie Mellon University, * for equal contribution

Introduction

Reinforcement learning for LLMs is expensive, with rollout generation often dominating runtime by 80% [Seer]. On top of the prior works exploring ways to lower the rollout costs, Jackpot pushes the scenario of actor-policy mismatch to the extreme and ask the following question:
Is it possible to perform rollouts using a completely different model from the one we utimately want to train?
The potential efficiency benefit of completely decoupling the rollout and policy is obvious, and the smaller variant in the same model family can usually achieve non-zero rewards on challenging datasets. However, different from the past actor-policy mismatch scenarios (Large staleness, KV quantization, etc.), the KL divergence between the actor and policy between two different models is orders of magnitude larger. To attain our goal, we propose Jackpot that utilizes Optimal Budgeted Rejection Sampling (OBRS) to actively modify the rollout distribution and rollout to reduce the distribution mismatch. We show that Jackpot can reduce KL divergence with provable guarantees and empirically achieve significant stability improvements on challenging Qwen3-1.7B-Base rollout and Qwen3-8B-Base update settings (up to 300 steps of stable training on DeepScaleR dataaset).

General Description of actor-policy mismatch and Jackpot Performance

Left: General Description of the actor-policy mismatch scenario. Jackpot is generally applicable. Right: Illustration of Jackpot performance on the two model joint training scenario, achieving significantly more stable training than TIS.

Optimal Budgeted Rejection Sampling (OBRS)

Prior Distribution Mismatch Correction in RL can be generally formulated as, an Actor (i.e., the rollout model) and a policy (i.e., the trained model) mismatch is a common problem that has long been studied, e.g., (Espeholt et al., 2018). To alleviate the actor–policy distribution gap, prior methods leverage importance sampling to approximate the true PPO objective.

\[ \mathcal{L}^{\text{PPO}}(\theta) = \mathbb{E}_{x \sim p_{\text{inf}}}\Big[ \text{SG}\!\Big( \underbrace{\boldsymbol{F}\!\Big(\tfrac{p_{\text{ref}}(x)}{p_{\text{inf}}(x)}\Big)}_{\text{adjustment}} \Big) \min\Big( r_\theta(x)\,\hat{A}(x), \operatorname{clip}\!\big(r_\theta(x), 1-\epsilon, 1+\epsilon\big)\,\hat{A}(x) \Big) \Big]. \]

\( \text{SG} \) means stop gradients. The choice of adjustment function \( \boldsymbol{F} \) is one of the core focuses of prior work. Areal uses \( \boldsymbol{F}(x)=x \); Flash-RL and Llama-RL use \( \boldsymbol{F}(x)=\min(x,C) \) (i.e., truncated importance sampling, TIS), where \( C \) is a hyper-parameter; IceProp uses a bi-directional truncation:

\[ \boldsymbol{F}(x)= \begin{cases} x, & \text{if } x \in [\alpha,\beta],\\[6pt] 0, & \text{otherwise}. \end{cases} \]

From a systems perspective, FP32 LM heads and deterministic LLM inference are implemented to mitigate numerical issues in serving systems during rollout.

However, Importance Sampling based methods are vulnerable to extreme actor-policy mismatch. Once the actor drifts too far, many tokens that the actor samples with high probability have very low probability under the policy, since \( p_{\text{inf}} > p_{\text{target}} \). These actor trajectories are effectively treated as low-likelihood samples by the policy, causing TIS to train on tokens the policy would never select at inference and creating a widening train-inference mismatch. It motivates us to look at distribution alignment methods that directly modify \( p_{\text{inf}} \).

Rejection Sampling (RS).
Rejection Sampling (RS) allows us to directly use the policy distribution \( p_{\text{target}} \) to affect the rollout distribution \( p_{\text{inf}} \).

Definition Rejection Sampling (RS) (click to fold/unfold)

RS stochastically rejects tokens in trajectories sampled with \(p_{\text{inf}}\) based on the difference between two distributions. In standard rejection sampling, a token \(i\) proposed from \(\boldsymbol{q}\) is accepted with probability \( \frac{p_i}{\lambda q_i} \), where the constant \(\lambda\) must satisfy \( \lambda \ge \max_i \frac{p_i}{q_i} \).

Therefore, we have \( \tilde{q}_i \propto q_i \cdot \frac{p_i}{\lambda q_i} \). Then \( \tilde{q}_i = p_i \) after normalization. Therefore, rejection sampling transforms samples from \(\boldsymbol{q}\) into exact samples from the target distribution \(\boldsymbol{p}\).

\text{Accept}(i) = \frac{p_i}{\lambda q_i}, \qquad \lambda \ge \max_i \frac{p_i}{q_i} \] \[ \tilde{q}_i \propto q_i \cdot \frac{p_i}{\lambda q_i} \;\Rightarrow\; \tilde{q}_i \propto p_i \;\Rightarrow\; \tilde{q}_i = p_i \;(\text{normalized})

The problem of RS is that LLM output distribution is high dimensional, usually exceeding 30k tokens. Realistically, using naive RS will requires \( \lambda \) to be of high numerical values and effectively rejecting all tokens from rollouts.
Instead, we look at Optimal Budgeted Rejection Sampling [OBRS]. We stop aiming at forcing the rollout distribution post modification identical to the target distribution, and instead aim at given a budget of how many tokens we are willing to reject, how to most effectively select tokens to reject and minimize the KL divergence between the rollout distribution after modifcations and the target distribution.

Definition Optimal Budget Rejection Sampling (OBRS) (click to fold/unfold)

Instead of using the dominating constant \( \lambda = \max_i \tfrac{p_i}{q_i} \), OBRS selects a smaller user-specified parameter \( \lambda > 0 \) that reflects the desired rejection budget. For a proposed token \( i \sim \boldsymbol{q} \), OBRS accepts it with probability

a_i = \min\!\left(1,\; \frac{p_i}{\lambda q_i}\right).

The resulting post-rejection distribution is therefore

\tilde{q}_i \propto q_i \cdot a_i .

Next, we will look at the numerical simulation of OBRS and show theoretical guarantees of OBRS.

KL-Divergence Reduction via Numerical Simulation and Theoretical Guarantees

We conducted numerical experiments in (a) and (b), where we simulate the LLMs' output distribution with randomly generated Dirichlet distribution with controllable noise to attain different levels of KL divergence. (a) plots the simulated OBRS acceptance rate across different pairs of actor-policy distributions. While overal trend sees acceptance rate slowly decreasing as the distributions move further apart, the acceptance rate remains high (\(>90\%\)) throughout the spectrum. For reference, we also marks the actor-policy gap of different common seen off-policy settings. In (b), we show that OBRS significantly shrinks the KL divergence between the target distribution versus the applied distribution in our simulations, sometimes by an order of magnitude. In (c), our proposed method, Jackpot (yellow) maintains small KL divergence between actor and policy model probability distribution, while without alginment and TIS both seen KL divergence explosively rise up as the training continues.
In addition to the numerical simulation, OBRS provides the following strong theoretical guarantees.

Theorem OBRS Improves Distribution Alignment (KL Divergence \( \downarrow \)) (click to fold/unfold)

Let \( \boldsymbol{p} \) be the target distribution and \( \boldsymbol{q} \) be the proposal distribution. For any \( \lambda > 0 \), define the OBRS acceptance rule

a_i = \min\!\left(1, \frac{p_i}{\lambda q_i}\right), \qquad \tilde{q}_i \propto q_i a_i .

Then the post-rejection distribution \( \tilde{\boldsymbol{q}} \) is strictly closer to \( \boldsymbol{p} \) than the original proposal \( \boldsymbol{q} \) in the sense that

D_{\mathrm{KL}}(\boldsymbol{p} \,\|\, \tilde{\boldsymbol{q}}) \;\le\; D_{\mathrm{KL}}(\boldsymbol{p} \,\|\, \boldsymbol{q}), \] \[ \text{whenever } \lambda < \max_i \frac{p_i}{q_i}.

In other words, OBRS always moves the proposal distribution toward the target distribution under any nontrivial rejection budget.

Theorem Optimality of OBRS under a Fixed Acceptance Budget (click to fold/unfold)

For any desired average acceptance rate \( \bar{a} \in (0,1] \), there exists a unique scaling factor \( \lambda > 0 \) such that the OBRS acceptance rule

a_i = \min\!\left(1, \frac{p_i}{\lambda q_i}\right)

achieves the exact acceptance budget:

\sum_i q_i a_i = \bar{a}.

Moreover, among all acceptance rules \( a_i \in [0,1] \) that satisfy this constraint, the OBRS rule is the unique minimizer of the divergence to the target distribution:

\tilde{\boldsymbol{q}} = \arg\min_{\hat{\boldsymbol{q}}} D_{\mathrm{KL}}(\boldsymbol{p} \,\|\, \hat{\boldsymbol{q}}), \quad \text{s.t.}\quad \hat{q}_i \propto q_i a_i,\; \sum_i q_i a_i = \bar{a}.

Thus, OBRS is provably optimal for aligning \( \boldsymbol{q} \) toward \( \boldsymbol{p} \) under any specified rejection budget. Detailed proofs are provided in the paper.

Jackpot

Jackpot jointly optimizes three components:

(i) OBRS-adjusted RL loss for the policy model \( \theta \)
(ii) Standard PPO loss for the rollout model \( \omega \)
(iii) On-policy distillation to keep rollout aligned with policy

Below we describe the policy objective.

OBRS-Adjusted RL Loss for the Policy Model

Tokens are sampled from the rollout distribution \(p_{\text{inf}}\). Each token \(x\) is accepted via OBRS:

a(x)= \min\!\left(1,\frac{p_{\text{target}}(x)}{\lambda\,p_{\text{inf}}(x)}\right), \qquad \mathrm{Mask}(x)\sim\mathrm{Bernoulli}(a(x)).

Accepted tokens define the adjusted sampling distribution

p'_{\text{inf}}(x)= \frac{p_{\text{inf}}(x)a(x)} {\sum_{x'} p_{\text{inf}}(x')a(x')}.

The standard PPO objective

\mathcal{L}^{\text{PPO}}(\theta)= \mathbb{E}_{x\sim p_{\text{inf}}} \Big[ \mathrm{SG}\!\left( \boldsymbol{F}\!\left(\tfrac{p_{\text{ref}}(x)}{p_{\text{inf}}(x)}\right) \right) \min\big( r_\theta(x)\hat{A}(x), \operatorname{clip}(r_\theta(x),1-\epsilon,1+\epsilon)\hat{A}(x) \big) \Big]

becomes the OBRS-aware objective

\mathcal{L}^{\text{PPO-OBRS}}(\theta)= \mathbb{E}_{x\sim p_{\text{inf}}} \Big[ \mathrm{Mask}(x)\, \mathrm{SG}\!\left( \boldsymbol{F}\!\left( \frac{p_{\text{target}}(x)}{p'_{\text{inf}}(x)} \right) \frac{p_{\text{ref}}(x)}{p_{\text{target}}(x)} \right) \min\big( r_\theta(x)\hat{A}(x), \operatorname{clip}(r_\theta(x),1-\epsilon,1+\epsilon)\hat{A}(x) \big) \Big]

Rejected samples are removed by \( \mathrm{Mask}(x) \), while accepted samples are reweighted through \( p'_{\text{inf}} \). The function \( \boldsymbol{F} \) provides truncated importance correction. The target distribution \(p_{\text{target}}\) may be chosen as either \(p_{\text{ref}}\) or the updated policy \(p_{\theta_{\text{new}}}\) for alignment.

Experimental

Jackpot Enables Stable Training of One model Rollout and Another Completely Different Model Update

Jackpot enables probability distribution alignment beyond existing methods. In the extreme two-model joint training setting, Jackpot allows the smaller, weaker model to roll out trajectories that are then used by the larger, stronger model for training. We show that prior TIS methods — even when augmented with KL regularization — consistently suffer from unstable training across three settings: Qwen2.5 (1.5B vs. 3B), Qwen3 (1.7B vs. 4B), and Qwen3 (1.7B vs. 8B) base models. In contrast, Jackpot achieves performance comparable to the large model’s on-policy performance.

Jackpot Can Enable Removal of Clipping in other actor-policy mismatch settings

(a) Jackpot enables removal of clipping in stale RL training. (b) Jackpot shows neither improvement nor degradation when actor–policy distributions are relatively close and can be sufficiently corrected by TIS. (c) When KL regularization is removed, Jackpot consistently sustains training longer than TIS counterparts.

We present more details about the experiment setup, ablations, and results in the paper.

Limitations

Despite Jackpot's effectiveness, there are still important limitations and gaps from existing methods.
1. Moreover, although Jackpot reduces the mismatch between the rollout model and the policy model, using a fully separate, smaller actor model to train a large and expensive policy does not completely eliminate the distribution gap or the resulting training instability. In our experiments, training may still diverge after extended optimization (e.g., beyond 300 update steps).
2. As shown in Paper Section 6, when the distribution shift is already small or adequately controlled by existing techniques (e.g., TIS, PPO clipping, KL regularization, etc.), Jackpot yields only minor improvements over standard baselines. We leave these as future work.

References

If you find Jackpot useful, please cite:

BibTeX

@misc{chen2026jackpotoptimalbudgetedrejection,
title={Jackpot: Optimal Budgeted Rejection Sampling for Extreme Actor-Policy Mismatch Reinforcement Learning}, 
author={Zhuoming Chen and Hongyi Liu and Yang Zhou and Haizhong Zheng and Beidi Chen},
year={2026},
eprint={2602.06107},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.06107}, 
}
}