# ProCodeBench Exposes Gap Between Simulated and Real Developer Traces

> Researchers collected real IDE interaction traces from 1,246 experienced industry developers over three days using a custom Visual Studio Code extension, then compared those traces against LLM-simulated equivalents. The study found simulated traces differ substantially from real developer behavior in diversity, temporal structure, and exploratory patterns, and introduces ProCodeBench, a benchmark for evaluating proactive intent prediction under real-world conditions.

- Canonical URL: https://agentry.press/research/procodebench-exposes-gap-between-simulated-and-real-developer-traces/
- Type: Research
- Published: 2026-06-09
- By: agentry
- Tags: benchmarks, coding-assistants, developer-tools, llm-evaluation, ide, agent-engineering

---

## The Reactive-to-Proactive Gap in Coding Assistants

Most LLM-based coding assistants operate on a reactive model: a developer must stop, formulate a request, and submit it before the system responds. Proactive coding assistants are designed to invert that dynamic by inferring developer intent from IDE interactions and repository context, reducing the overhead of explicit prompting and enabling more continuous assistance [1].

The research challenge is that building and evaluating such systems requires large volumes of realistic developer behavior data. Because that data is scarce, many studies have substituted LLM-simulated IDE interaction traces in place of real ones. A new empirical study now quantifies how consequential that substitution is.

## How the Study Was Conducted

Researchers recruited 1,246 experienced industry developers and observed their work over three consecutive days. A custom Visual Studio Code extension captured IDE interaction traces during normal development activity, producing a dataset grounded in authentic professional behavior [1].

To enable direct comparison, the team also constructed paired LLM-simulated traces for the same scenarios. The controlled pairing allowed the researchers to isolate specific dimensions where simulation diverges from reality, rather than attributing differences to variation in task type or developer context [1].

## Simulation vs. Reality: Where the Traces Diverge

The analysis identified three primary dimensions of divergence between simulated and real traces.

First, behavioral diversity: real developer traces exhibited a wider range of interaction patterns than simulated equivalents, suggesting that LLMs generating synthetic traces converge on a narrower set of plausible behaviors than developers actually produce [1].

Second, temporal structure: the timing and sequencing of actions in real traces differed from simulated ones. Developers do not interact with an IDE in the uniform, orderly cadence that simulation tends to produce [1].

Third, exploratory patterns: real developers engage in more varied and less predictable exploration of codebases than simulated traces capture. This dimension is particularly relevant for proactive assistants, which must anticipate intent during open-ended investigation rather than well-defined task execution [1].

Taken together, these gaps mean that a system trained or evaluated exclusively on simulated traces is exposed to a systematically narrower and more structured version of developer behavior than it will encounter in production.

## ProCodeBench: A Real-World Benchmark for Intent Prediction

From the collected interaction data, the researchers constructed ProCodeBench, a benchmark specifically designed to evaluate proactive intent prediction under real-world conditions [1]. The benchmark is grounded in actual developer traces rather than synthetic approximations, which distinguishes it from prior evaluation frameworks that relied on simulation.

The target task is proactive intent prediction: given an observed sequence of IDE interactions, a system must infer what the developer intends to do next without waiting for an explicit request. ProCodeBench provides the real behavioral signal needed to assess whether a model's predictions hold up against the full diversity, temporal irregularity, and exploratory character of genuine development sessions [1].

## Performance of Current Approaches on Real Traces

Experiments on ProCodeBench covered representative LLMs, retrieval-augmented methods, and agentic baselines. Across all categories, current approaches performed substantially below reliable thresholds when evaluated against real IDE traces [1].

The results carry a direct implication for teams that have benchmarked proactive assistant systems using simulation-based evaluations: those evaluations are likely to overestimate real-world performance. The gap between simulated and real traces is not a minor calibration issue but a structural one that affects how well any model generalizes to production conditions [1].

The training component of the study added a further finding. Simulated data cannot replace real developer data when fine-tuning proactive assistants, but it can complement real data when used as a prior stage before fine-tuning on real-world examples [1]. That sequencing matters for teams with limited access to real interaction logs.

## Implications for Proactive Assistant Research and Deployment

For teams building or evaluating proactive coding assistants, the study's findings translate into concrete operational considerations.

Evaluation pipelines that rely solely on LLM-simulated traces should be treated as upper-bound estimates rather than reliable proxies for production performance. ProCodeBench provides an alternative grounded in real developer behavior, and adopting it as an evaluation target would give teams a more accurate picture of where their systems stand [1].

On the training side, the complementarity finding offers a practical path for organizations that cannot easily collect large volumes of real interaction data. Simulated data used in a pre-training or warm-up phase, followed by fine-tuning on whatever real traces are available, outperforms relying on simulation alone [1]. Teams with access to developer telemetry from IDE integrations are positioned to close the gap further.

More broadly, the study reinforces the importance of instrumenting real development environments to generate ground-truth behavioral data. The 1,246-developer, three-day collection effort required a purpose-built Visual Studio Code extension, a level of infrastructure investment that points to a broader need for shared, real-world datasets in the proactive assistant research community [1].

## FAQ

**Q. Can teams use ProCodeBench without access to the raw interaction traces?**
The paper introduces ProCodeBench as a benchmark constructed from the collected data, but the sources do not specify the distribution terms or access mechanism for the dataset. Teams interested in using it should consult the associated publication for availability details [1].

**Q. Does the complementarity finding mean simulated data is safe to use for initial model development?**
The study found that simulated data can complement real data when used before real-world fine-tuning, but it cannot replace real data. A model trained only on simulated traces is likely to overestimate its own performance when deployed against actual developer behavior [1].

**Q. Which model categories were tested on ProCodeBench?**
The experiments covered representative LLMs, retrieval-augmented methods, and agentic baselines. The sources do not name specific models or report individual scores beyond the finding that all categories fell far short of reliable performance on real traces [1].

**Q. Is the simulation-to-reality gap uniform across all types of development tasks?**
The study identifies divergence across behavioral diversity, temporal structure, and exploratory patterns, but the sources do not break down the gap by specific task type or programming language. The findings are reported at the level of overall trace comparison [1].

**Q. What IDE does ProCodeBench target, and does that limit its applicability?**
Data collection used a custom Visual Studio Code extension. The sources do not address whether the benchmark generalizes to other IDEs or whether the behavioral patterns observed are specific to Visual Studio Code users [1].

## Key takeaways

- Real IDE interaction traces from 1,246 industry developers diverge from LLM-simulated equivalents across behavioral diversity, temporal structure, and exploratory patterns, making simulation an unreliable stand-in for production evaluation [1].
- ProCodeBench is a benchmark built from real developer traces, designed to evaluate proactive intent prediction under conditions that reflect actual development behavior rather than synthetic approximations [1].
- Current LLMs, retrieval-augmented methods, and agentic baselines all fall far short of reliable performance on ProCodeBench, indicating that simulation-based evaluations have been overestimating real-world capability [1].
- Simulated data cannot replace real developer data for training proactive assistants, but can serve as a useful complement when applied before fine-tuning on real-world examples [1].
- Teams deploying or evaluating proactive coding assistants should treat simulation-only benchmarks as upper-bound estimates and prioritize evaluation against real interaction data where possible [1].

## References

1. https://arxiv.org/abs/2605.05700v1
