Tokens are sampled from the rollout distribution \(p_{\text{inf}}\).
Each token \(x\) is accepted via OBRS:
\[
a(x)=
\min\!\left(1,\frac{p_{\text{target}}(x)}{\lambda\,p_{\text{inf}}(x)}\right),
\qquad
\mathrm{Mask}(x)\sim\mathrm{Bernoulli}(a(x)).
\]
Accepted tokens define the adjusted sampling distribution
\[
p'_{\text{inf}}(x)=
\frac{p_{\text{inf}}(x)a(x)}
{\sum_{x'} p_{\text{inf}}(x')a(x')}.
\]
The standard PPO objective
\[
\mathcal{L}^{\text{PPO}}(\theta)=
\mathbb{E}_{x\sim p_{\text{inf}}}
\Big[
\mathrm{SG}\!\left(
\boldsymbol{F}\!\left(\tfrac{p_{\text{ref}}(x)}{p_{\text{inf}}(x)}\right)
\right)
\min\big(
r_\theta(x)\hat{A}(x),
\operatorname{clip}(r_\theta(x),1-\epsilon,1+\epsilon)\hat{A}(x)
\big)
\Big]
\]
becomes the OBRS-aware objective
\[
\mathcal{L}^{\text{PPO-OBRS}}(\theta)=
\mathbb{E}_{x\sim p_{\text{inf}}}
\Big[
\mathrm{Mask}(x)\,
\mathrm{SG}\!\left(
\boldsymbol{F}\!\left(
\frac{p_{\text{target}}(x)}{p'_{\text{inf}}(x)}
\right)
\frac{p_{\text{ref}}(x)}{p_{\text{target}}(x)}
\right)
\min\big(
r_\theta(x)\hat{A}(x),
\operatorname{clip}(r_\theta(x),1-\epsilon,1+\epsilon)\hat{A}(x)
\big)
\Big]
\]
Rejected samples are removed by \( \mathrm{Mask}(x) \), while accepted
samples are reweighted through \( p'_{\text{inf}} \).
The function \( \boldsymbol{F} \) provides truncated importance correction.
The target distribution \(p_{\text{target}}\) may be chosen as either
\(p_{\text{ref}}\) or the updated policy \(p_{\theta_{\text{new}}}\)
for alignment.