# GroupTravelBench Tests LLM Agents on Group Travel Planning

> Researchers have released GroupTravelBench, a benchmark designed to evaluate large language model agents on multi-person travel planning tasks. The benchmark comprises 650 synthesized tasks drawn from real user profiles and point-of-interest data, and tests three capabilities absent from prior single-user benchmarks: preference elicitation, conflict coordination, and group-utility planning.

- Canonical URL: https://agentry.press/research/grouptravelbench-tests-llm-agents-on-group-travel-planning/
- Type: Research
- Published: 2026-06-09
- By: agentry
- Tags: benchmarking, llm-agents, travel-planning, multi-agent, evaluation, planning

---

## What GroupTravelBench Is

GroupTravelBench is a benchmark designed to evaluate large language model agents on multi-person, multi-turn travel planning tasks [1]. The release targets a specific gap in existing evaluation infrastructure: prior travel planning benchmarks assume a single user, which sidesteps the conflict-resolution and preference-aggregation challenges that arise when multiple people plan a trip together [1].

The benchmark comprises 650 synthesized tasks and is divided into three difficulty levels. Its authors describe it as the first benchmark focused explicitly on multi-user, multi-turn travel planning, addressing a gap left by existing single-user itinerary benchmarks [1].

## Three Core Capabilities Under Evaluation

GroupTravelBench tests three capabilities that do not appear in standard single-user travel planning evaluations.

The first is elicitation. Agents must proactively engage in multi-turn dialogue to gather preferences from each individual user in the group. This requires the agent to initiate and sustain conversation rather than simply respond to a fully specified query [1].

The second is coordination. Once preferences are collected, agents must resolve conflicts among users. The benchmark recognizes two resolution strategies: compromise, where all users accept a middle-ground option, and subgrouping, where the group splits to accommodate incompatible preferences [1].

The third is planning. Agents must search for travel plans that maximize overall group utility while simultaneously maintaining fairness across group members and ensuring feasibility given real-world constraints [1]. Each of these dimensions extends beyond the multi-step reasoning and tool-use abilities that single-user benchmarks already probe.

## Benchmark Construction and Sandbox Environment

The 650 tasks in GroupTravelBench were synthesized from real user profiles, point-of-interest data, and ticket price data [1]. Grounding the tasks in real-world data sources is intended to make the scenarios representative of actual planning conditions rather than purely hypothetical.

To support reliable evaluation without requiring live API calls, the researchers built an interactive offline sandbox environment. The sandbox uses cached real-world tool data, allowing agents to invoke tools during multi-turn dialogue while keeping evaluation reproducible across runs [1]. This design choice addresses a common challenge in agent benchmarking: live external services introduce variability that can make repeated evaluation inconsistent.

## Difficulty Levels and Evaluation Methodology

The benchmark's 650 tasks are organized into three difficulty tiers [1]. The source does not detail the precise criteria distinguishing each tier, but the tiered structure is intended to capture a range of planning complexity.

Evaluation metrics cover preference coverage and group fairness [1]. Multi-turn dialogue is also assessed, reflecting the elicitation dimension of the benchmark.

## How Frontier Models Performed

The researchers evaluated a wide range of LLMs on GroupTravelBench, including frontier models. The results showed that even leading models exhibit substantial weaknesses on the benchmark's core metrics [1]. Preference coverage and group fairness were the two areas where frontier models fell most noticeably short, suggesting that current LLMs handle single-user planning better than they handle the aggregation and conflict-resolution demands of group scenarios [1].

The source does not provide specific numeric scores for individual models, but the characterization of the shortfalls as "substantial" indicates the gap is not marginal.

## Implications for Agent Benchmarking

GroupTravelBench surfaces a category of multi-stakeholder planning that existing agent benchmarks do not systematically address. The benchmark's construction, combining real data sources with an offline sandbox, is designed to make it both practical and reproducible for the research community [1].

The findings suggest that improving LLM agent performance on group travel planning will likely require advances in proactive dialogue management, conflict-resolution reasoning, and fairness-aware optimization. These are capabilities that single-user benchmarks, by design, do not demand, meaning that high scores on prior travel planning evaluations do not transfer directly to the multi-user setting [1].

## FAQ

**Q. Does GroupTravelBench require live internet access or API calls to run evaluations?**
No. The benchmark includes an interactive offline sandbox environment with cached real-world tool data, enabling reliable and reproducible evaluation without live external service calls [1].

**Q. What kinds of conflict-resolution strategies does the benchmark recognize?**
The benchmark recognizes two strategies: compromise, where all group members accept a shared middle-ground option, and subgrouping, where the party splits to accommodate incompatible preferences [1].

**Q. How many tasks are in the benchmark, and how are they organized?**
The benchmark contains 650 synthesized tasks divided into three difficulty levels [1]. Tasks were constructed from real user profiles, point-of-interest data, and ticket price data.

**Q. Do high scores on existing single-user travel planning benchmarks predict performance on GroupTravelBench?**
The source indicates that GroupTravelBench tests capabilities absent from single-user benchmarks, specifically elicitation, coordination, and group-utility planning, so performance on prior benchmarks does not directly predict results here [1].

**Q. Which specific models were tested, and what scores did they achieve?**
The source states that a wide range of LLMs, including frontier models, were evaluated and showed substantial weaknesses in preference coverage and group fairness, but does not provide individual model names or numeric scores in the available material [1].

## Key takeaways

- GroupTravelBench is a 650-task benchmark targeting multi-user, multi-turn travel planning, a scenario not covered by existing single-user travel planning evaluations [1].
- The benchmark evaluates three distinct capabilities: preference elicitation through multi-turn dialogue, conflict coordination via compromise or subgrouping, and group-utility planning with fairness constraints [1].
- An offline sandbox with cached real-world tool data enables reproducible agent evaluation without dependence on live external services [1].
- Frontier LLMs showed substantial weaknesses specifically on preference coverage and group fairness metrics, the two dimensions most tied to multi-stakeholder dynamics [1].
- The benchmark is grounded in real user profiles, point-of-interest data, and ticket price data, connecting synthesized tasks to realistic planning conditions [1].

## References

1. https://arxiv.org/abs/2605.25200v1
