Five arxiv Papers Advance RLVR Post-Training for LLMs

The Post-Training Problem Space

Outcome-based reinforcement learning with verifiable rewards (RLVR) has become a standard post-training tool for improving LLM reasoning, but practitioners operating on hard mathematical or coding tasks routinely encounter three compounding failures: sparse reward signals that leave most rollouts uninformative, poor credit assignment that cannot distinguish which tokens or steps drove a correct answer, and exploration failures that trap policies in gradient dead zones where correct final answers almost never appear [15]. A cluster of late-May 2026 arxiv preprints attacks each of these failure modes with distinct algorithmic interventions, offering practitioners a menu of drop-in or complementary improvements to existing GRPO-based pipelines.

Fixing Credit Assignment: ConSPO, SCRL, and Temporal Scheduling

ConSPO (Contrastive Sequence-level Policy Optimization) starts from a reformulation of GRPO as a weighted positive-negative score difference, then identifies two structural weaknesses in that framing. First, GRPO optimizes clipped token-level importance-sampling ratios rather than the generation likelihoods that govern autoregressive sampling, creating a misalignment between what is optimized and what is deployed. Second, GRPO assigns rollout-level credit without accounting for the relative score gap between positive and negative rollouts in the same group. ConSPO replaces the clipped-ratio scores with length-normalized sequence log-probabilities and substitutes a group-wise InfoNCE-style contrastive objective that amplifies updates for poorly separated positives while concentrating suppressive updates on high-scoring negatives. A curriculum-scheduled margin guides optimization from coarse positive-negative ordering early in training toward stronger separation later [4].

SCRL (Subproblem Curriculum Reinforcement Learning) addresses the gradient dead-zone problem differently. When a hard problem almost never yields a correct final answer, outcome-based RLVR receives near-zero gradient signal. SCRL derives verifiable subproblems directly from reference reasoning chains, fixes the final subproblem as the original problem, and normalizes rewards independently at each subproblem position, assigning the resulting advantages to the corresponding answer spans. This converts partial progress on hard problems into usable learning signals without external rubrics or reward models. On seven mathematical reasoning benchmarks, SCRL improved average accuracy over GRPO by 4.1 points on Qwen3-4B-Base and 1.9 points on Qwen3-14B-Base, with larger relative gains as problem difficulty increased [15].

A third paper argues that the temporal dimension of credit allocation has been overlooked entirely. Standard RLVR applies a globally broadcast scalar reward across all sampled tokens throughout training, ignoring the heterogeneous policy behaviors that appear at different trajectory positions and at different training stages. The paper introduces temporal scheduling of credit allocation criteria, finding that prioritizing targeted tokens associated with specific policy behaviors early and gradually attenuating toward general optimization produces more stable entropy dynamics. The analysis shows that standard optimization substantially sacrifices policy entropy when simultaneously accommodating heterogeneous behaviors, whereas temporal scheduling avoids this collapse [26].

Guiding Exploration Without Reward Hacking: SMEPO and Self-Play Stability

Expert traces offer a natural remedy for exploration failures, but they introduce a reward hacking risk: if the trace exposes the final answer or intermediate values that the verifier checks, the policy can obtain reward by copying rather than reasoning. SMEPO (Semantic Masked Expert Policy Optimization) addresses this with fine-grained semantic masking that targets reward-relevant spans along the critical path, including final answers, intermediate values, executable implementations, and answer-related entities, while preserving the expert’s decomposition, plan, and procedural structure. The result is a fill-in-the-blank process: the policy follows the expert’s problem-solving route but must reconstruct the masked content independently. SMEPO requires no changes to the reward function or RL objective and reported accuracy improvements of up to 3.2 points over GRPO alongside training-time reductions of up to 4.2x across math, code, and agentic search domains [23].

A separate study on self-play RL stability challenges the common assumption that training collapse is primarily a reward-design problem. Through controlled experiments on a Python output-prediction task and a deterministic-DSL variant, the paper identifies two asymmetric levers: a data-level gate that controls which proposer-generated tasks enter the training pool, and the reward signal applied to admitted tasks. A strict gate proved sufficient for stability across every reward variant tested, including a self-consistency reward with no ground-truth access. No reward variant was sufficient once the gate was removed. The paper also identifies what it calls the Grounded Proposer Paradox: a proposer with ground-truth access accelerated collapse faster than an ungrounded one when paired with a self-consistency solver, because it concentrated training on clean tasks that formed the fastest path to a spurious self-consistent attractor [14].

Distillation Improvements: MOPD, Local Teachability Collapse, and F-TIS

On-policy distillation (OPD) provides denser token-level supervision than sparse verifier rewards, but existing methods distill each rollout independently. MOPD (Multi-Rollout On-Policy Distillation) conditions the teacher on the student’s full local rollout group, using successful peer rollouts as positive evidence for valid reasoning patterns and failed rollouts as structured negative evidence about plausible mistakes. Two peer-context constructions are studied: positive peer imitation and contrastive success-failure conditioning. Teacher-signal analysis showed that mixed success-failure contexts better aligned teacher scores with verifier rewards, indicating that gains arise from more faithful, instance-adaptive supervision [3].

A complementary paper identifies a failure mode in strong-to-weak OPD called local teachability collapse. Later segments of a generated trajectory may retain a non-zero teacher-student advantage yet lack the local contrast that makes dense feedback effective for prioritizing student learning. The proposed fix is a trajectory-specific release rule that measures the teacher’s margin over the student’s top-K candidate set, aggregates this margin across sentence segments, and truncates dense OPD supervision upon detecting a BIC-style downward change point. Experiments on the Qwen3 model family showed consistent outperformance over standard full-trajectory OPD on five in-domain benchmarks and better preservation of out-of-domain capabilities [11].

For practitioners operating in decentralized or heterogeneous compute environments, F-TIS (Filtered Truncated Importance Sampling) extends GRPO to settings where collaborating models differ in architecture or scale. Off-policy samples from heterogeneous participants are handled through filtered importance sampling, and the framework showed identical final model convergence to purely on-policy training in extensive evaluations, with up to 12% better performance on out-of-distribution tasks in some configurations [21].

Data Selection and Curation for Reasoning Training

Data quality is increasingly recognized as a binding constraint on post-training effectiveness. GRACE (Gradient-Aligned Reasoning Data Curation) scores each step within a reasoning trace by two signals: its alignment with the answer-oriented gradient direction and its consistency with the preceding reasoning trajectory. Step-level scores are aggregated into sample-level values using only the model’s internal optimization signals, with no external reward models or step annotations. A representation-level gradient proxy estimates step-level alignment from token-level signals in a single forward pass. Post-training Qwen3-VL-2B-Instruct on MMathCoT-1M, GRACE reached 108.8% of full-data performance with 20% of the data and retained 100.2% with only 5% [8].

A training-free alternative, the High-Entropy Sum (HES) metric, quantifies reasoning quality by summing the entropy of only the highest-entropy tokens in each sample (the top 0.5% by entropy). HES was validated across supervised fine-tuning, rejection fine-tuning, and RL paradigms. In SFT, training on the top 20% HES-ranked data matched full-dataset performance, while using the lowest-HES data degraded it [13].

A third data-focused paper investigated whether reasoning dataset utility can be predicted before training using intrinsic metrics, finding that predictors are scale-dependent. Smaller models relied on alignment-focused metrics to ensure precision, while larger models benefited from high redundancy, using verbose traces to solve complex tasks [7].

FAQ

Q. Can ConSPO be adopted as a drop-in replacement for GRPO without changing the training infrastructure? ConSPO replaces GRPO’s clipped-ratio scoring with length-normalized log-probabilities and substitutes an InfoNCE-style contrastive objective, but the paper does not describe infrastructure changes beyond the objective itself [4]. Practitioners should expect changes to the loss computation but not to the rollout generation pipeline.

Q. Does SCRL require annotated step-level reward models or external rubrics? No. SCRL derives verifiable subproblems from reference reasoning chains and uses subproblem-level normalization to assign advantages to corresponding answer spans, explicitly without external rubrics or reward models [15].

Q. Is data gating in self-play RL more important than reward design for preventing collapse? According to the self-play stability study, a strict data-level gate was sufficient for stability under every reward variant tested, while no reward variant was sufficient once the gate was removed. The paper characterizes data gating as the binding constraint [14].

Q. Does SMEPO require modifications to the reward function or RL objective? No. SMEPO operates by masking reward-relevant semantic spans in expert traces before they are presented to the policy and requires no changes to the reward function or RL objective [23].

Q. How does the state-distribution view change how practitioners should think about choosing between SFT, RL, and on-policy distillation? The state-distribution paper argues that the source and locality of training states can be as important as the form of the supervision signal. It found that on-policy distillation from a degraded SFT teacher surpassed that teacher on multiple benchmarks, and that a lightweight on-policy RL run improved reasoning while preserving retention, supporting a state-centric rather than loss-function-centric view of method selection [20].

Key Takeaways

ConSPO addresses two structural weaknesses in GRPO by replacing clipped-ratio scores with length-normalized log-probabilities and introducing an InfoNCE-style contrastive objective with a curriculum-scheduled margin [4].
SCRL lifts hard problems out of gradient dead zones by deriving verifiable subproblems from reference chains and normalizing rewards at each subproblem position, yielding larger relative gains as problem difficulty increases [15].
In self-play RL, data gating is the primary lever against training collapse; strict gating stabilized training across all reward variants tested, while no reward design compensated for a missing gate [14].
GRACE and HES offer complementary data-selection strategies: GRACE uses gradient-aligned step-level scoring to retain 5-20% of data with near-full performance, while HES provides a training-free entropy-based metric that generalizes across SFT, RFT, and RL paradigms [8][13].
The state-distribution view of post-training suggests that where supervision is applied, specifically whether training states come from fixed datasets or from the current learner, can be as consequential as the choice of loss function [20].