CUA-Gym Generates 32K Verified RLVR Tuples for Computer-Use Agents

The Data Bottleneck Blocking Computer-Use Agent Training

Reinforcement learning with verifiable rewards has produced measurable gains in domains such as mathematics, tool use, and software engineering, where deterministic reward signals are relatively straightforward to construct [1]. Computer-use agents have not shared in those gains. Building training data for this category of agent requires three things to coexist: a consistent task instruction, an executable environment, and a verifiable reward function. Achieving all three at scale has proven difficult.

Two prior approaches each solve part of the problem. Hand-curated benchmarks deliver high reward fidelity but cover only a narrow slice of applications. Datasets that rely on a large language model as a judge scale more broadly but sacrifice reliable verification [1]. Neither path alone produces the volume of trustworthy training signal that RLVR requires. CUA-Gym is a pipeline designed to close that gap.

What CUA-Gym Is and What It Produces

CUA-Gym is a scalable co-generation pipeline that outputs verified RLVR training tuples suitable for training computer-use agents. The released dataset contains 32,112 verified tuples grounded across 110 environments [1].

Because real-world training environments were themselves scarce, the researchers also synthesized CUA-Gym-Hub, described as a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions. CUA-Gym-Hub expands the scale of available computer-use RLVR data by a significant margin compared to prior work [1]. Together, the dataset, the environment suite, the synthesis pipeline, and trained model checkpoints are slated for open-source release.

How the Generator-Discriminator-Orchestrator Loop Works

The pipeline centers on three distinct agents working in coordination. A Generator agent is responsible for constructing both the initial environment state and the golden (target) environment state for a given task. A Discriminator agent, working from the task specification, writes the reward function that will be used to evaluate whether an agent has completed the task correctly. An orchestrator agent drives both the Generator and Discriminator through iterative execution rounds, managing the loop until the components are mutually consistent [1].

This architecture addresses a core challenge in computer-use agent data generation: task instructions, environment states, and reward functions must be coherent with one another. Generating them independently and combining them afterward risks misalignment between what the task asks, what the environment contains, and what the reward function actually measures. The adversarial loop between Generator and Discriminator, mediated by the orchestrator, is intended to enforce that coherence before any tuple leaves the generation stage [1].

Quality Filtering and Reward Verification

The per-task adversarial loop is not the final quality gate. After generation, each tuple passes through an additional filtering stage that combines LLM majority voting with agent rollouts [1]. Majority voting aggregates judgments from multiple model calls to reduce the influence of any single erroneous evaluation. Agent rollouts test whether an agent can actually complete the task in the generated environment and receive the expected reward signal.

This two-part filter is positioned as a safeguard that enforces quality beyond what the Generator-Discriminator loop alone can guarantee [1]. The combination is designed to catch cases where the adversarial loop produces internally consistent but practically flawed tuples, such as tasks that are unsolvable in the generated environment state or reward functions that do not correctly reflect task completion.

Benchmark Results with CUA-Gym-A3B and GSPO Training

Models trained on CUA-Gym data using the GSPO training algorithm produced results on two benchmarks. On OSWorld-Verified, the CUA-Gym-A3B model reached 62.1% and the CUA-Gym-A17B model reached 72.6%. Both figures represent improvements over prior open-source computer-use agents at comparable parameter scales [1].

The researchers also evaluated the same checkpoints on WebArena, a held-out benchmark not used during training. Performance improvements on WebArena indicate that the models transfer beyond the specific environments included in the CUA-Gym training set [1]. The paper also reports that performance scales smoothly with both data volume and environment diversity, suggesting that expanding the pipeline’s output could yield further gains without architectural changes.

Implications for Scaling Computer-Use Agent Research

CUA-Gym is primarily relevant to researchers and practitioners working on training computer-use agents with reinforcement learning. The open-source release of the synthesis pipeline, dataset, CUA-Gym-Hub environments, and model checkpoints means teams do not need to reconstruct the data generation infrastructure from scratch [1].

Limitations remain. The 110 environments, while broader than prior hand-curated benchmarks, still represent a bounded slice of real-world software. The mock web applications in CUA-Gym-Hub are grounded in real-world software-use distributions but are not identical to production systems. Transfer to WebArena is a positive indicator, but it covers one held-out benchmark rather than the full range of deployment contexts practitioners encounter.

For the broader RLVR and agent-training community, the dataset addresses a specific structural gap: the absence of verified, scalable training data with deterministic rewards for GUI-based agents. Whether the pipeline’s co-generation approach generalizes to non-web desktop environments or other interaction modalities is not addressed in the current work [1].

FAQ

Q. What training algorithm was used with CUA-Gym data, and is it required? The paper reports results using GSPO to train the CUA-Gym-A3B and CUA-Gym-A17B models [1]. The sources do not specify whether the dataset is compatible with other RLVR training algorithms or whether GSPO is a hard requirement.

Q. How does CUA-Gym-Hub differ from existing web benchmarks? CUA-Gym-Hub is described as a suite of mock web applications synthesized to reflect real-world software-use distributions, rather than a set of live or recorded web interactions [1]. It was created specifically to expand the number of available training environments, which the researchers identified as a separate bottleneck from training data volume.

Q. Does the dataset transfer to environments outside the 110 included ones? The WebArena benchmark results suggest some degree of transfer beyond the training environments, but the sources do not quantify how broadly that transfer extends or which environment types benefit most [1].

Q. When will the pipeline, dataset, and models be publicly available? The paper states the researchers will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models, but the sources do not specify a release date or hosting location [1].

Q. What distinguishes CUA-Gym’s verification from LLM-as-judge approaches? CUA-Gym combines a Generator-Discriminator adversarial loop during generation with a final filter using LLM majority voting and agent rollouts [1]. The paper positions this as more reliable than LLM-as-judge datasets, which the authors describe as scaling broadly but lacking reliable verification.

Key takeaways

CUA-Gym produces 32,112 verified RLVR training tuples across 110 environments, addressing the scarcity of deterministic training signals for computer-use agents [1].
A three-agent architecture (Generator, Discriminator, orchestrator) co-generates task instructions, environment states, and reward functions through iterative execution rounds, enforcing internal consistency before filtering [1].
A final quality filter combining LLM majority voting and agent rollouts provides an additional verification layer beyond the per-task adversarial loop [1].
Models trained on CUA-Gym data with GSPO reached 62.1% (3B scale) and 72.6% (17B scale) on OSWorld-Verified, outperforming prior open-source agents at comparable sizes, with transfer gains also observed on WebArena [1].
The full pipeline, dataset, CUA-Gym-Hub mock web application suite, and model checkpoints are planned for open-source release [1].