Eval Harness for a CrewAI Agent Using Braintrust and Verifiable Rewards

Why this matters

CUA-Gym [1] demonstrated that pairing a Generator agent with a separate Discriminator agent that writes verifiable reward functions produces training data far more reliable than LLM-as-judge scoring. The key insight is that deterministic rewards, ones that execute against a known ground-truth state, eliminate the ambiguity that makes LLM-graded evals hard to trust in CI. The same pattern translates directly to evaluation: instead of asking a model whether an agent’s answer is “good”, you write a reward function that checks whether the tool-call trajectory produced the right side effects.

Most published CrewAI tutorials stop at demos. Without an eval story, teams discover regressions only when a customer reports them. This tutorial wires a CrewAI research workflow to a Braintrust eval harness where each test case carries its own reward function, producing a score between 0 and 1 that is fully reproducible and diffable across commits.

Prerequisites

Python 3.11 or 3.12
A Braintrust API key (free tier works; set as BRAINTRUST_API_KEY)
An OpenAI API key (set as OPENAI_API_KEY)
Basic familiarity with CrewAI agents and tasks

Setup

Install the required packages. braintrust provides the eval SDK; autoevals ships scoring helpers; crewai and crewai-tools provide the agent framework.

uv pip install crewai crewai-tools braintrust autoevals openai

Export your API keys. The blocks that actually call OpenAI or Braintrust are marked skip_execution_reason so the sandbox does not fail on missing credentials. All structural and reward-function blocks run without keys.

export BRAINTRUST_API_KEY="your-braintrust-key-here"
export OPENAI_API_KEY="your-openai-key-here"

Step 1: Define the CrewAI workflow

The workflow models a two-agent research pipeline: a ResearchAgent that calls a web-search tool and a SummaryAgent that condenses the findings. Both agents record their tool calls into a shared trajectory list, which the reward functions inspect later.

The tool is a stub that returns deterministic fake results so the structural tests run without network access.

# filename: crew_workflow.py
from __future__ import annotations

import json
from typing import Any

from crewai import Agent, Crew, Task
from crewai.tools import BaseTool
from pydantic import BaseModel, Field


# ---------------------------------------------------------------------------
# Shared trajectory collector
# ---------------------------------------------------------------------------

class TrajectoryCollector:
    """Accumulates tool-call events produced during a crew run."""

    def __init__(self) -> None:
        self.events: list[dict[str, Any]] = []

    def record(self, tool_name: str, input_data: Any, output: Any) -> None:
        self.events.append(
            {"tool": tool_name, "input": input_data, "output": output}
        )

    def reset(self) -> None:
        self.events.clear()


COLLECTOR = TrajectoryCollector()


# ---------------------------------------------------------------------------
# Stub search tool
# ---------------------------------------------------------------------------

class SearchInput(BaseModel):
    query: str = Field(description="The search query string.")


class StubSearchTool(BaseTool):
    name: str = "web_search"
    description: str = "Search the web for information about a topic."
    args_schema: type[BaseModel] = SearchInput
    # Injected at construction time so tests can swap results
    fake_results: dict[str, str] = Field(default_factory=dict)

    def _run(self, query: str) -> str:  # type: ignore[override]
        result = self.fake_results.get(
            query,
            f"[stub] No result configured for query: {query}",
        )
        COLLECTOR.record("web_search", {"query": query}, result)
        return result


# ---------------------------------------------------------------------------
# Agent and crew factory
# ---------------------------------------------------------------------------

def build_crew(fake_results: dict[str, str] | None = None) -> Crew:
    """Return a Crew whose agents share the global COLLECTOR."""
    search_tool = StubSearchTool(fake_results=fake_results or {})

    researcher = Agent(
        role="Research Analyst",
        goal="Find accurate information on the requested topic.",
        backstory="You are a meticulous researcher who always cites sources.",
        tools=[search_tool],
        verbose=False,
        # Model is injected lazily by CrewAI from env; no client constructed here
        llm="gpt-4o-mini",
    )

    summariser = Agent(
        role="Content Summariser",
        goal="Produce a concise, accurate summary of research findings.",
        backstory="You distil complex information into clear prose.",
        tools=[],
        verbose=False,
        llm="gpt-4o-mini",
    )

    research_task = Task(
        description=(
            "Search for information about '{topic}' and return the raw findings."
        ),
        expected_output="Raw search results relevant to the topic.",
        agent=researcher,
    )

    summary_task = Task(
        description="Summarise the research findings in three bullet points.",
        expected_output="A three-bullet-point summary.",
        agent=summariser,
        context=[research_task],
    )

    return Crew(
        agents=[researcher, summariser],
        tasks=[research_task, summary_task],
        verbose=False,
    )

Step 2: Write the reward functions

This is the core of the CUA-Gym discriminator pattern [1]. Each reward function is a plain Python callable that accepts the agent’s output and the recorded trajectory, then returns a float in [0, 1]. Deterministic checks (did the right tool fire? did the output contain required strings?) replace LLM-as-judge scoring for the cases where ground truth is knowable.

# filename: rewards.py
from __future__ import annotations

import re
from typing import Any


def reward_tool_called(tool_name: str):
    """Return a reward function that checks whether a specific tool was called."""

    def _reward(output: str, trajectory: list[dict[str, Any]]) -> float:
        called = any(e["tool"] == tool_name for e in trajectory)
        return 1.0 if called else 0.0

    _reward.__name__ = f"reward_tool_called_{tool_name}"
    return _reward


def reward_query_contains(substring: str):
    """Return a reward function checking that a tool input contained a substring."""

    def _reward(output: str, trajectory: list[dict[str, Any]]) -> float:
        for event in trajectory:
            inp = event.get("input", {})
            query = inp.get("query", "") if isinstance(inp, dict) else str(inp)
            if substring.lower() in query.lower():
                return 1.0
        return 0.0

    _reward.__name__ = f"reward_query_contains_{substring}"
    return _reward


def reward_output_contains(substring: str):
    """Return a reward function checking that the final output contains a substring."""

    def _reward(output: str, trajectory: list[dict[str, Any]]) -> float:
        return 1.0 if substring.lower() in output.lower() else 0.0

    _reward.__name__ = f"reward_output_contains_{substring}"
    return _reward


def reward_bullet_count(expected: int):
    """Return a reward function that checks the number of bullet points."""

    def _reward(output: str, trajectory: list[dict[str, Any]]) -> float:
        bullets = re.findall(r"^\s*[-*•]", output, re.MULTILINE)
        count = len(bullets)
        if count == expected:
            return 1.0
        # Partial credit: 0.5 if within one bullet of expected
        if abs(count - expected) == 1:
            return 0.5
        return 0.0

    _reward.__name__ = f"reward_bullet_count_{expected}"
    return _reward


def composite_reward(reward_fns: list, weights: list[float] | None = None):
    """Weighted average of multiple reward functions."""
    if weights is None:
        weights = [1.0] * len(reward_fns)
    assert len(reward_fns) == len(weights)
    total_weight = sum(weights)

    def _reward(output: str, trajectory: list[dict[str, Any]]) -> float:
        score = sum(
            w * fn(output, trajectory)
            for fn, w in zip(reward_fns, weights)
        )
        return score / total_weight

    return _reward

Verify the reward functions work in isolation before wiring them to Braintrust.

from rewards import (
    reward_tool_called,
    reward_query_contains,
    reward_output_contains,
    reward_bullet_count,
    composite_reward,
)

fake_trajectory = [
    {"tool": "web_search", "input": {"query": "climate change 2024"}, "output": "Some results"}
]
fake_output = "- Point one\n- Point two\n- Point three"

assert reward_tool_called("web_search")(fake_output, fake_trajectory) == 1.0
assert reward_tool_called("missing_tool")(fake_output, fake_trajectory) == 0.0
assert reward_query_contains("climate")(fake_output, fake_trajectory) == 1.0
assert reward_query_contains("quantum")(fake_output, fake_trajectory) == 0.0
assert reward_bullet_count(3)(fake_output, fake_trajectory) == 1.0
assert reward_bullet_count(5)(fake_output, fake_trajectory) == 0.0
assert reward_bullet_count(4)(fake_output, fake_trajectory) == 0.5

combo = composite_reward(
    [reward_tool_called("web_search"), reward_bullet_count(3)],
    weights=[1.0, 1.0],
)
assert combo(fake_output, fake_trajectory) == 1.0

print("All reward function assertions passed.")

Step 3: Define the eval dataset

Each test case is a dictionary with:

input: the topic the crew will research
fake_results: the stub search results to inject
expected_keywords: strings that must appear in the final output
reward_fn: the composite reward function for this case

This mirrors the CUA-Gym tuple structure (task instruction, environment state, reward function) [1] but expressed as plain Python rather than a database.

# filename: eval_dataset.py
from rewards import (
    composite_reward,
    reward_bullet_count,
    reward_output_contains,
    reward_query_contains,
    reward_tool_called,
)

DATASET = [
    {
        "input": {"topic": "climate change"},
        "fake_results": {
            "climate change": (
                "Global temperatures rose 1.1°C above pre-industrial levels in 2023. "
                "Arctic ice loss accelerated. Renewable energy capacity doubled."
            )
        },
        "expected": "climate",
        "reward_fn": composite_reward(
            [
                reward_tool_called("web_search"),
                reward_query_contains("climate"),
                reward_output_contains("climate"),
                reward_bullet_count(3),
            ],
            weights=[1.0, 1.0, 1.0, 2.0],
        ),
    },
    {
        "input": {"topic": "quantum computing"},
        "fake_results": {
            "quantum computing": (
                "IBM unveiled a 1000-qubit processor. Error correction improved. "
                "Commercial quantum advantage demonstrated for optimization problems."
            )
        },
        "expected": "quantum",
        "reward_fn": composite_reward(
            [
                reward_tool_called("web_search"),
                reward_query_contains("quantum"),
                reward_output_contains("quantum"),
                reward_bullet_count(3),
            ],
            weights=[1.0, 1.0, 1.0, 2.0],
        ),
    },
    {
        "input": {"topic": "large language models"},
        "fake_results": {
            "large language models": (
                "LLMs now exceed 1 trillion parameters. "
                "Inference costs dropped 10x year-over-year. "
                "Multimodal capabilities became standard."
            )
        },
        "expected": "language",
        "reward_fn": composite_reward(
            [
                reward_tool_called("web_search"),
                reward_query_contains("language"),
                reward_output_contains("language"),
                reward_bullet_count(3),
            ],
            weights=[1.0, 1.0, 1.0, 2.0],
        ),
    },
]

Step 4: Build the Braintrust eval harness

Braintrust’s Eval function accepts a data iterable, a task callable, and a list of scores. The task runs the crew; the scores are thin wrappers that call the per-case reward function.

The key design decision: the reward function travels with the dataset row, so each test case is self-contained. Adding a new test case means adding one dictionary, not modifying a central scoring file.

The reward function travels with the dataset row, so each test case is self-contained.

# filename: eval_harness.py
from __future__ import annotations

import asyncio
from typing import Any

import braintrust
from braintrust import Eval

from crew_workflow import COLLECTOR, build_crew
from eval_dataset import DATASET


# ---------------------------------------------------------------------------
# Task: run the crew and return (output, trajectory)
# ---------------------------------------------------------------------------

def run_crew_task(input_data: dict[str, Any]) -> dict[str, Any]:
    """Execute the crew for one eval case and return output + trajectory."""
    COLLECTOR.reset()
    crew = build_crew(fake_results=input_data.get("fake_results", {}))
    result = crew.kickoff(inputs={"topic": input_data["input"]["topic"]})
    raw_output = str(result)
    return {"output": raw_output, "trajectory": list(COLLECTOR.events)}


# ---------------------------------------------------------------------------
# Score wrappers — Braintrust expects callables that accept (output, expected)
# ---------------------------------------------------------------------------

def make_trajectory_scorer(name: str):
    """Return a Braintrust-compatible scorer that unpacks trajectory from output."""

    def scorer(output: dict[str, Any], expected: Any) -> braintrust.Score:
        reward_fn = expected  # we pass the reward_fn as the 'expected' value
        score_value = reward_fn(
            output.get("output", ""),
            output.get("trajectory", []),
        )
        return braintrust.Score(name=name, score=score_value)

    scorer.__name__ = name
    return scorer


COMPOSITE_SCORER = make_trajectory_scorer("composite_reward")


# ---------------------------------------------------------------------------
# Braintrust Eval entry point
# ---------------------------------------------------------------------------

def build_eval_data():
    """Yield (input, expected) pairs in the format Braintrust expects."""
    for case in DATASET:
        yield {
            "input": case,          # full case dict passed to run_crew_task
            "expected": case["reward_fn"],  # reward fn passed to scorer
        }


async def run_eval():
    await Eval(
        name="crewai-verifiable-rewards",
        data=list(build_eval_data()),
        task=run_crew_task,
        scores=[COMPOSITE_SCORER],
    )


if __name__ == "__main__":
    asyncio.run(run_eval())

Step 5: Run the eval against Braintrust

With keys set, run the harness. Braintrust streams results to stdout and records them in your project dashboard.

python eval_harness.py

This block requires live API keys and is skipped in the sandbox.

Step 6: CI-friendly offline scoring

For pull-request checks you often want a fast pass/fail without hitting Braintrust’s API. The offline scorer runs the same reward functions locally and exits non-zero if the mean score falls below a threshold. Use this in your CI pipeline before the full Braintrust run.

# filename: offline_eval.py
from __future__ import annotations

import sys
from typing import Any

from crew_workflow import COLLECTOR, build_crew
from eval_dataset import DATASET

PASS_THRESHOLD = 0.6  # mean composite score required to pass


def run_case_offline(case: dict[str, Any]) -> float:
    COLLECTOR.reset()
    crew = build_crew(fake_results=case.get("fake_results", {}))
    result = crew.kickoff(inputs={"topic": case["input"]["topic"]})
    raw_output = str(result)
    trajectory = list(COLLECTOR.events)
    return case["reward_fn"](raw_output, trajectory)


def main() -> None:
    scores: list[float] = []
    for i, case in enumerate(DATASET):
        score = run_case_offline(case)
        topic = case["input"]["topic"]
        status = "PASS" if score >= PASS_THRESHOLD else "FAIL"
        print(f"[{status}] case {i+1} ({topic!r}): score={score:.3f}")
        scores.append(score)

    mean = sum(scores) / len(scores) if scores else 0.0
    print(f"\nMean score: {mean:.3f} (threshold: {PASS_THRESHOLD})")
    if mean < PASS_THRESHOLD:
        print("EVAL FAILED: mean score below threshold.")
        sys.exit(1)
    else:
        print("EVAL PASSED.")


if __name__ == "__main__":
    main()

Verify it works

The verification block runs the reward functions and the offline eval scaffolding without any API keys. It uses stub outputs that simulate what the crew would produce, confirming the reward pipeline is wired correctly end-to-end.

from rewards import (
    composite_reward,
    reward_bullet_count,
    reward_output_contains,
    reward_query_contains,
    reward_tool_called,
)
from eval_dataset import DATASET
from crew_workflow import COLLECTOR, StubSearchTool, TrajectoryCollector

# Simulate what the crew would produce for the first test case
case = DATASET[0]
fake_trajectory = [
    {
        "tool": "web_search",
        "input": {"query": "climate change"},
        "output": case["fake_results"]["climate change"],
    }
]
fake_output = (
    "- Global temperatures rose 1.1 degrees above pre-industrial levels.\n"
    "- Arctic ice loss accelerated significantly in 2023.\n"
    "- Renewable energy capacity doubled over the past decade."
)

score = case["reward_fn"](fake_output, fake_trajectory)
print(f"Composite reward for climate case: {score:.3f}")
assert score > 0.8, f"Expected score > 0.8, got {score:.3f}"

# Confirm all three dataset cases have reward functions attached
for i, c in enumerate(DATASET):
    assert callable(c["reward_fn"]), f"Case {i} missing reward_fn"
    assert "input" in c and "topic" in c["input"], f"Case {i} missing topic"

# Confirm the trajectory collector resets cleanly
COLLECTOR.record("web_search", {"query": "test"}, "result")
assert len(COLLECTOR.events) == 1
COLLECTOR.reset()
assert len(COLLECTOR.events) == 0

print("End-to-end reward pipeline verified.")

Troubleshooting

ModuleNotFoundError: No module named 'crewai' — Run uv pip install crewai crewai-tools braintrust autoevals openai from the same Python environment you use to run the scripts. CrewAI is not in the standard library.

AuthenticationError from OpenAI during crew.kickoff() — The OPENAI_API_KEY environment variable is not set or is set to a placeholder. Export the real key before running the offline or Braintrust eval.

braintrust.Score import fails — Older Braintrust SDK versions used a different import path. Run uv pip install --upgrade braintrust to get the current SDK, which exposes braintrust.Score directly.

Crew produces output but reward_bullet_count always returns 0 — CrewAI’s kickoff() return value is a CrewOutput object. The str() call in run_crew_task converts it to text, but the summariser agent may not use Markdown bullet syntax. Adjust the task’s expected_output field to explicitly request - bullet format, or relax the reward function to also match numbered lists.

Braintrust eval hangs without printing results — The Eval function is async. Make sure you call it with asyncio.run(run_eval()) rather than calling run_eval() directly in a synchronous context.

Mean score is 0.0 for every case — The stub fake_results dict keys must match the exact query string the researcher agent sends. If the agent reformulates the query (e.g. “climate change 2024” instead of “climate change”), the stub returns the fallback string and the trajectory reward functions score zero. Either broaden the stub keys or use reward_query_contains instead of exact-match checks.

Next steps

Add a regression gate to CI: run python offline_eval.py in your GitHub Actions workflow and fail the PR if the mean score drops below the threshold. Store per-commit scores as Braintrust experiment metadata to get a trend chart.
Replace stubs with recorded fixtures: capture real crew runs with COLLECTOR and serialize the trajectories to JSON. Replay them in tests without hitting OpenAI, giving you fast deterministic checks for trajectory shape.
Implement a Generator/Discriminator loop: following CUA-Gym [1], have a second agent automatically write reward functions for new task types by inspecting the task description and expected output schema. This scales the dataset without manual reward engineering.
Score tool-call ordering: extend the reward functions to check not just whether a tool was called but whether it was called before or after another tool, catching agent reasoning regressions that produce the right answer via the wrong path.

FAQ

How do verifiable rewards differ from LLM-as-judge scoring?

Verifiable rewards are deterministic functions that check concrete facts about the agent’s trajectory and output, such as whether a specific tool was called or whether the output contains required keywords. LLM-as-judge scoring asks a model to evaluate quality subjectively, introducing ambiguity and non-reproducibility. Verifiable rewards eliminate this ambiguity by comparing against known ground truth.

What does the trajectory collector record during a crew run?

The TrajectoryCollector records each tool call as a dictionary containing the tool name, input parameters, and output. These events are accumulated in a list that the reward functions inspect later to verify whether the agent called the right tools and passed the correct queries.

How can I run eval checks in CI without hitting the Braintrust API?

The offline_eval.py script runs the same reward functions locally and exits with a non-zero status if the mean composite score falls below a configurable threshold. This provides fast pass-fail checks suitable for pull-request gates before the full Braintrust run.

Why does the reward function travel with each dataset row?

Embedding the reward function in each test case makes each case self-contained and allows different test cases to use different scoring logic without modifying a central scoring file. Adding a new test case requires only adding one dictionary to the dataset.

What should I do if the stub search tool returns a fallback string?

The stub returns a fallback string when the agent’s query does not match any key in the fake_results dictionary. Either add the exact query string as a key in fake_results, or use reward_query_contains with a substring match instead of exact-match checks to handle query reformulations.