# Tool-Call Debugging in CrewAI Agents Using Braintrust Evals

> Build a CrewAI agent with three instrumented tools, log every tool-call input and output as a Braintrust experiment row, write a custom scorer, and diff two agent versions side-by-side. When tool calls fail silently in production, this pipeline tells you exactly which call, which input, and why.

- Canonical URL: https://agentry.press/tutorial/tool-call-debugging-in-crewai-agents-using-braintrust-evals/
- Type: Tutorial
- Published: 2026-06-05
- By: agentry
- Tags: crewai, braintrust, evals, tool-calling, debugging, llm-ops

---

## Why this matters

Multi-agent frameworks like CrewAI are increasingly used to coordinate teams of specialized agents, each relying on tool calls to interact with the world [2]. The operational problem is that tool calls fail in ways that are invisible at the agent level: a search tool returns an empty result, a calculator receives a malformed expression, a formatter silently truncates output. The agent continues, produces a plausible-looking answer, and the failure is only discovered downstream.

Braintrust's experiment API lets you log individual tool-call inputs and outputs as scored rows, replay them deterministically, and diff two versions of an agent against the same dataset. Without this harness, debugging a CrewAI agent means reading raw logs and guessing which of a dozen tool calls caused the regression. With it, you get a table of every call, a score per call, and a visual diff between `v1` and `v2`.

This tutorial builds that harness from scratch: a three-tool CrewAI agent, a logging wrapper that captures every tool invocation, a Braintrust experiment that scores each call, and a version-diff script.

## Prerequisites

- Python 3.11 or later
- A Braintrust account and API key (free tier works; set `BRAINTRUST_API_KEY`)
- An OpenAI API key (set `OPENAI_API_KEY`)
- Familiarity with CrewAI's `Agent`, `Task`, and `Tool` primitives

## Setup

Install the required packages.

```bash
uv pip install crewai braintrust autoevals openai
```

Export your API keys. The blocks that actually call OpenAI or Braintrust are marked `skip_execution_reason` so the sandbox does not fail on missing credentials.

```bash
export BRAINTRUST_API_KEY="your-braintrust-api-key"
export OPENAI_API_KEY="your-openai-api-key"
```

Verify the installs.

```python
from importlib.metadata import version
for pkg in ["crewai", "braintrust", "autoevals", "openai"]:
    print(f"{pkg}: {version(pkg)}")
```

## Step 1: Define the Three Tools

The agent will use three tools: a unit converter, a word counter, and a simple CSV formatter. Each tool is deliberately simple so the tutorial focuses on the logging harness rather than tool complexity.

```python
# filename: tools.py
import csv
import io
from crewai.tools import tool

@tool("unit_converter")
def unit_converter(expression: str) -> str:
    """
    Convert between common units. Input format: '<value> <from_unit> to <to_unit>'.
    Supported: km/miles, kg/lbs, celsius/fahrenheit.
    Example: '5 km to miles'
    """
    try:
        parts = expression.strip().lower().split()
        if len(parts) != 4 or parts[2] != "to":
            return f"ERROR: invalid format '{expression}'. Use '<value> <from> to <to>'"
        value = float(parts[0])
        from_unit = parts[1]
        to_unit = parts[3]
        conversions = {
            ("km", "miles"): lambda v: v * 0.621371,
            ("miles", "km"): lambda v: v * 1.60934,
            ("kg", "lbs"): lambda v: v * 2.20462,
            ("lbs", "kg"): lambda v: v / 2.20462,
            ("celsius", "fahrenheit"): lambda v: v * 9/5 + 32,
            ("fahrenheit", "celsius"): lambda v: (v - 32) * 5/9,
        }
        fn = conversions.get((from_unit, to_unit))
        if fn is None:
            return f"ERROR: unsupported conversion '{from_unit}' to '{to_unit}'"
        result = fn(value)
        return f"{value} {from_unit} = {result:.4f} {to_unit}"
    except Exception as e:
        return f"ERROR: {e}"


@tool("word_counter")
def word_counter(text: str) -> str:
    """
    Count words, sentences, and characters in the provided text.
    Returns a summary string.
    """
    if not text or not text.strip():
        return "ERROR: empty input"
    words = text.split()
    sentences = [s.strip() for s in text.replace('!', '.').replace('?', '.').split('.') if s.strip()]
    chars = len(text)
    return f"words={len(words)}, sentences={len(sentences)}, characters={chars}"


@tool("csv_formatter")
def csv_formatter(data: str) -> str:
    """
    Format a pipe-delimited table string into CSV.
    Input: 'header1|header2|header3\\nval1|val2|val3\\n...'
    Returns a valid CSV string.
    """
    if not data or not data.strip():
        return "ERROR: empty input"
    output = io.StringIO()
    writer = csv.writer(output)
    for line in data.strip().splitlines():
        row = [cell.strip() for cell in line.split("|")]
        writer.writerow(row)
    return output.getvalue().strip()
```

## Step 2: Build the Logging Wrapper

This is the core of the tutorial. The `LoggedTool` wrapper intercepts every call to a CrewAI tool, records the input, calls the real tool, records the output and any error, and appends the record to a shared list. After the agent run, that list becomes the Braintrust experiment dataset.

```python
# filename: logged_tool.py
import time
import traceback
from typing import Any
from crewai.tools import BaseTool


class LoggedTool(BaseTool):
    """
    Wraps any CrewAI BaseTool and records every invocation to a shared log list.
    """
    name: str = "logged_tool"
    description: str = "A logged wrapper around another tool."
    inner: Any  # the real BaseTool
    call_log: Any  # a list shared across all LoggedTool instances

    class Config:
        arbitrary_types_allowed = True

    def __init__(self, inner: BaseTool, call_log: list, **kwargs):
        super().__init__(
            name=inner.name,
            description=inner.description,
            inner=inner,
            call_log=call_log,
            **kwargs,
        )

    def _run(self, **kwargs) -> str:
        # CrewAI passes tool arguments as keyword arguments
        # Flatten to a single string if the tool expects one positional arg
        if len(kwargs) == 1:
            input_value = next(iter(kwargs.values()))
        else:
            input_value = str(kwargs)

        start = time.perf_counter()
        error = None
        output = ""
        try:
            output = self.inner._run(input_value)
        except Exception as e:
            error = traceback.format_exc()
            output = f"ERROR: {e}"
        elapsed_ms = (time.perf_counter() - start) * 1000

        self.call_log.append({
            "tool": self.name,
            "input": input_value,
            "output": output,
            "error": error,
            "latency_ms": round(elapsed_ms, 2),
        })
        return output
```

## Step 3: Assemble the Agent

The agent receives the three logged tools. The `build_agent` function accepts an injected model name so structural tests can verify the graph without constructing a live OpenAI client.

```python
# filename: agent.py
from crewai import Agent, Task, Crew
from tools import unit_converter, word_counter, csv_formatter
from logged_tool import LoggedTool


def build_crew(call_log: list, model: str = "gpt-4o-mini"):
    """
    Build a CrewAI Crew with three logged tools.
    Returns (crew, task) so callers can run crew.kickoff().
    """
    logged_tools = [
        LoggedTool(inner=unit_converter, call_log=call_log),
        LoggedTool(inner=word_counter, call_log=call_log),
        LoggedTool(inner=csv_formatter, call_log=call_log),
    ]

    analyst = Agent(
        role="Data Analyst",
        goal="Answer multi-part questions using the provided tools.",
        backstory=(
            "You are a precise data analyst. You always use tools to compute "
            "answers rather than guessing. You use each tool at least once."
        ),
        tools=logged_tools,
        llm=model,
        verbose=False,
    )

    task = Task(
        description=(
            "Complete ALL of the following:\n"
            "1. Convert 42 km to miles.\n"
            "2. Count the words in: 'The quick brown fox jumps over the lazy dog. "
            "Pack my box with five dozen liquor jugs.'\n"
            "3. Format this pipe-delimited table as CSV:\n"
            "   name|age|city\n"
            "   Alice|30|New York\n"
            "   Bob|25|London\n"
            "Report all three results clearly."
        ),
        expected_output="Three clearly labeled results: conversion, word count, and CSV.",
        agent=analyst,
    )

    crew = Crew(agents=[analyst], tasks=[task], verbose=False)
    return crew, task
```

Verify the module loads and the crew assembles without touching any API.

```python
from agent import build_crew

call_log = []
# Pass a dummy model string — no OpenAI client is constructed until kickoff()
crew, task = build_crew(call_log, model="gpt-4o-mini")
print("agents:", [a.role for a in crew.agents])
print("tools:", [t.name for t in crew.agents[0].tools])
print("crew assembled OK")
```

## Step 4: Run the Agent and Capture Tool Calls

This block requires both API keys. It runs the crew, prints the final answer, and dumps the raw call log so you can inspect it before sending anything to Braintrust.

```python
# filename: run_agent.py
import json
from agent import build_crew

def run_and_log(model: str = "gpt-4o-mini", version_tag: str = "v1"):
    call_log = []
    crew, _ = build_crew(call_log, model=model)
    result = crew.kickoff()
    return str(result), call_log

if __name__ == "__main__":
    answer, calls = run_and_log(version_tag="v1")
    print("=== Agent answer ===")
    print(answer)
    print(f"\n=== {len(calls)} tool calls logged ===")
    for i, c in enumerate(calls):
        print(f"[{i}] {c['tool']} | input={c['input']!r} | output={c['output']!r} | {c['latency_ms']}ms")
```

## Step 5: Score Tool Calls and Log to Braintrust

Each tool-call record becomes one Braintrust experiment row. The custom scorer checks three things: the output does not start with `ERROR:`, the output is non-empty, and the latency is under 500 ms. A second scorer uses `autoevals.Levenshtein` to measure output similarity against a reference (useful when you have golden outputs).

```python
# filename: eval_runner.py
import os
import json
import braintrust
from autoevals import Levenshtein

# Reference outputs for the three expected tool calls.
# In a real harness these come from a curated dataset.
REFERENCE_OUTPUTS = {
    "unit_converter": "42 km = 26.0976 miles",
    "word_counter": "words=16, sentences=2, characters=89",
    "csv_formatter": 'name,age,city\r\nAlice,30,New York\r\nBob,25,London',
}


def tool_health_scorer(output: str, expected: str, input: str, metadata: dict) -> dict:
    """
    Returns a score of 1.0 if the tool call looks healthy, 0.0 otherwise.
    Checks: no ERROR prefix, non-empty output, latency under 500 ms.
    """
    latency_ok = metadata.get("latency_ms", 0) < 500
    no_error = not output.strip().startswith("ERROR")
    non_empty = bool(output.strip())
    score = 1.0 if (no_error and non_empty and latency_ok) else 0.0
    reason_parts = []
    if not no_error:
        reason_parts.append("output is an error")
    if not non_empty:
        reason_parts.append("output is empty")
    if not latency_ok:
        reason_parts.append(f"latency {metadata.get('latency_ms')}ms >= 500ms")
    return {
        "name": "tool_health",
        "score": score,
        "reason": "; ".join(reason_parts) if reason_parts else "healthy",
    }


def run_experiment(call_log: list, version_tag: str = "v1", project: str = "crewai-tool-debug"):
    """
    Log each tool-call record as a Braintrust experiment row.
    Returns the experiment object.
    """
    experiment = braintrust.init(
        project=project,
        experiment=version_tag,
        api_key=os.environ.get("BRAINTRUST_API_KEY"),
    )

    levenshtein = Levenshtein()

    for record in call_log:
        tool_name = record["tool"]
        input_str = record["input"]
        output_str = record["output"]
        reference = REFERENCE_OUTPUTS.get(tool_name, "")

        health = tool_health_scorer(
            output=output_str,
            expected=reference,
            input=input_str,
            metadata={"latency_ms": record["latency_ms"]},
        )

        lev_result = levenshtein(
            output=output_str,
            expected=reference,
        )

        experiment.log(
            input={"tool": tool_name, "args": input_str},
            output=output_str,
            expected=reference,
            scores={
                "tool_health": health["score"],
                "levenshtein": lev_result.score if lev_result else None,
            },
            metadata={
                "tool": tool_name,
                "latency_ms": record["latency_ms"],
                "error": record["error"],
                "health_reason": health["reason"],
                "version": version_tag,
            },
        )

    experiment.flush()
    return experiment


if __name__ == "__main__":
    from run_agent import run_and_log

    print("Running v1 agent...")
    _, calls_v1 = run_and_log(model="gpt-4o-mini", version_tag="v1")
    exp_v1 = run_experiment(calls_v1, version_tag="v1")
    print(f"v1 experiment logged: {len(calls_v1)} rows")

    print("Running v2 agent (same model, different task wording)...")
    _, calls_v2 = run_and_log(model="gpt-4o-mini", version_tag="v2")
    exp_v2 = run_experiment(calls_v2, version_tag="v2")
    print(f"v2 experiment logged: {len(calls_v2)} rows")
```

## Step 6: Replay a Failing Tool Call

Once you have a Braintrust experiment, you can pull rows where `tool_health < 1.0` and replay them locally without re-running the full agent. This script fetches the failing rows from the experiment and re-invokes the tool directly.

```python
# filename: replay.py
import os
import json
from tools import unit_converter, word_counter, csv_formatter

TOOL_MAP = {
    "unit_converter": unit_converter,
    "word_counter": word_counter,
    "csv_formatter": csv_formatter,
}


def replay_call(tool_name: str, args: str) -> str:
    """
    Re-invoke a tool by name with the original args string.
    Useful for debugging a specific failing row without re-running the agent.
    """
    tool = TOOL_MAP.get(tool_name)
    if tool is None:
        return f"ERROR: unknown tool '{tool_name}'"
    return tool._run(args)


def replay_from_log(call_log: list, min_health: float = 1.0):
    """
    Filter call_log for rows where the output starts with ERROR,
    then replay each one and print the before/after.
    """
    failing = [r for r in call_log if r["output"].startswith("ERROR")]
    if not failing:
        print("No failing tool calls found in log.")
        return
    for record in failing:
        print(f"\n--- Replaying failing call: {record['tool']} ---")
        print(f"  original input : {record['input']!r}")
        print(f"  original output: {record['output']!r}")
        replayed = replay_call(record["tool"], record["input"])
        print(f"  replayed output: {replayed!r}")


if __name__ == "__main__":
    # Simulate a call log with one intentionally bad input
    sample_log = [
        {"tool": "unit_converter", "input": "42 km to miles", "output": "42 km = 26.0976 miles", "error": None, "latency_ms": 0.5},
        {"tool": "unit_converter", "input": "bad input here", "output": "ERROR: invalid format 'bad input here'. Use '<value> <from> to <to>'", "error": None, "latency_ms": 0.3},
        {"tool": "word_counter", "input": "", "output": "ERROR: empty input", "error": None, "latency_ms": 0.1},
    ]
    replay_from_log(sample_log)
```

## Step 7: Version Diff Script

This script compares two experiment versions by tool name, printing the score delta for each tool. In the Braintrust UI you get a visual diff; this script gives you the same information in a terminal.

```python
# filename: diff_versions.py
import os
import json
from collections import defaultdict


def summarize_log(call_log: list, version_tag: str) -> dict:
    """
    Compute per-tool average health score from a call log.
    Returns {tool_name: {avg_health, count, error_count}}.
    """
    per_tool = defaultdict(lambda: {"scores": [], "errors": 0})
    for record in call_log:
        name = record["tool"]
        is_error = record["output"].startswith("ERROR")
        score = 0.0 if is_error else 1.0
        per_tool[name]["scores"].append(score)
        if is_error:
            per_tool[name]["errors"] += 1

    summary = {}
    for tool, data in per_tool.items():
        scores = data["scores"]
        summary[tool] = {
            "version": version_tag,
            "avg_health": sum(scores) / len(scores) if scores else 0.0,
            "count": len(scores),
            "error_count": data["errors"],
        }
    return summary


def diff_versions(log_v1: list, log_v2: list):
    """
    Print a side-by-side diff of per-tool health scores between two versions.
    """
    s1 = summarize_log(log_v1, "v1")
    s2 = summarize_log(log_v2, "v2")
    all_tools = sorted(set(list(s1.keys()) + list(s2.keys())))

    print(f"{'Tool':<20} {'v1 health':>10} {'v2 health':>10} {'delta':>8}")
    print("-" * 52)
    for tool in all_tools:
        h1 = s1.get(tool, {}).get("avg_health", float("nan"))
        h2 = s2.get(tool, {}).get("avg_health", float("nan"))
        delta = h2 - h1 if (h1 == h1 and h2 == h2) else float("nan")
        delta_str = f"{delta:+.2f}" if delta == delta else "n/a"
        print(f"{tool:<20} {h1:>10.2f} {h2:>10.2f} {delta_str:>8}")


if __name__ == "__main__":
    # Simulate two versions: v2 fixes the empty-input bug in word_counter
    log_v1 = [
        {"tool": "unit_converter", "input": "42 km to miles", "output": "42 km = 26.0976 miles", "error": None, "latency_ms": 1.2},
        {"tool": "word_counter", "input": "", "output": "ERROR: empty input", "error": None, "latency_ms": 0.2},
        {"tool": "csv_formatter", "input": "name|age\nAlice|30", "output": "name,age\r\nAlice,30", "error": None, "latency_ms": 0.8},
    ]
    log_v2 = [
        {"tool": "unit_converter", "input": "42 km to miles", "output": "42 km = 26.0976 miles", "error": None, "latency_ms": 1.1},
        {"tool": "word_counter", "input": "The quick brown fox", "output": "words=4, sentences=1, characters=19", "error": None, "latency_ms": 0.3},
        {"tool": "csv_formatter", "input": "name|age\nAlice|30", "output": "name,age\r\nAlice,30", "error": None, "latency_ms": 0.7},
    ]
    diff_versions(log_v1, log_v2)
```

> [!PULLQUOTE]
> Without a per-tool-call log, debugging a CrewAI agent means reading raw logs and guessing which of a dozen tool calls caused the regression.

## Verify it works

Run the structural checks that do not require API keys: the tool functions, the replay script, and the version diff.

```python
import sys
sys.path.insert(0, "/workspace")

# 1. Verify the three tools produce correct outputs
from tools import unit_converter, word_counter, csv_formatter

assert "26.0976" in unit_converter._run("42 km to miles"), "unit_converter failed"
assert "words=16" in word_counter._run(
    "The quick brown fox jumps over the lazy dog. Pack my box with five dozen liquor jugs."
), "word_counter failed"
csv_out = csv_formatter._run("name|age|city\nAlice|30|New York\nBob|25|London")
assert "Alice" in csv_out and "New York" in csv_out, "csv_formatter failed"
print("tool outputs: OK")

# 2. Verify LoggedTool captures calls
from logged_tool import LoggedTool
call_log = []
logged_uc = LoggedTool(inner=unit_converter, call_log=call_log)
logged_uc._run(expression="10 kg to lbs")
assert len(call_log) == 1
assert call_log[0]["tool"] == "unit_converter"
assert "22.0462" in call_log[0]["output"]
print("LoggedTool capture: OK")

# 3. Verify error capture
logged_uc._run(expression="bad input")
assert call_log[1]["output"].startswith("ERROR")
print("LoggedTool error capture: OK")

# 4. Verify replay
from replay import replay_call
result = replay_call("unit_converter", "5 km to miles")
assert "3.1069" in result, f"unexpected replay result: {result}"
print("replay: OK")

# 5. Verify diff output
from diff_versions import diff_versions, summarize_log
log_a = [
    {"tool": "unit_converter", "input": "1 km to miles", "output": "1 km = 0.6214 miles", "error": None, "latency_ms": 1.0},
    {"tool": "word_counter", "input": "", "output": "ERROR: empty input", "error": None, "latency_ms": 0.1},
]
log_b = [
    {"tool": "unit_converter", "input": "1 km to miles", "output": "1 km = 0.6214 miles", "error": None, "latency_ms": 1.0},
    {"tool": "word_counter", "input": "hello world", "output": "words=2, sentences=1, characters=11", "error": None, "latency_ms": 0.2},
]
s_a = summarize_log(log_a, "a")
s_b = summarize_log(log_b, "b")
assert s_a["word_counter"]["avg_health"] == 0.0
assert s_b["word_counter"]["avg_health"] == 1.0
print("version diff logic: OK")

print("\nAll structural checks passed.")
```

To run the full end-to-end pipeline with real API calls:

```bash
# Requires OPENAI_API_KEY and BRAINTRUST_API_KEY to be set
python /workspace/eval_runner.py
```

## Troubleshooting

**`ModuleNotFoundError: No module named 'crewai'`** — The install block did not complete or was skipped. Run `uv pip install crewai braintrust autoevals openai` in your terminal before running any other block.

**`AuthenticationError` from OpenAI** — `OPENAI_API_KEY` is not set or is invalid. Run `export OPENAI_API_KEY=sk-...` in the same shell session before running `eval_runner.py`.

**`braintrust.APIError: 401 Unauthorized`** — `BRAINTRUST_API_KEY` is missing or expired. Retrieve a fresh key from your Braintrust project settings and re-export it.

**`LoggedTool._run` receives no arguments** — CrewAI passes tool arguments as keyword arguments whose names match the tool function's parameter names. If your tool function signature uses a parameter name other than the one CrewAI infers, the `kwargs` dict will be empty. Inspect `kwargs` with a `print` statement and align the parameter name.

**Braintrust experiment shows zero rows** — `experiment.flush()` was not called, or the process exited before the async flush completed. Always call `experiment.flush()` explicitly after the logging loop, as shown in `eval_runner.py`.

**`csv_formatter` output differs from the reference** — CSV line endings vary by platform (`\r\n` on Windows, `\n` on Linux). The `REFERENCE_OUTPUTS` dict in `eval_runner.py` uses `\r\n`; adjust it to match your platform's `csv.writer` output, or normalize both sides with `.replace('\r\n', '\n')` before scoring.

## Next steps

- **Add a dataset fixture**: Export your `call_log` to a JSON file after the first run and use it as a fixed dataset for regression tests. Braintrust's `Dataset` API lets you push rows once and reference them across experiments.
- **Score with an LLM judge**: Replace `tool_health_scorer` with a Braintrust `LLMClassifier` that asks a model whether the tool output is semantically correct given the input. This catches cases where the output is non-empty but wrong.
- **Instrument a multi-agent crew**: The `LoggedTool` wrapper is agent-agnostic. Attach it to tools in a LATTE-style adaptive task graph [2] to measure per-agent tool-call health across a coordinated team.
- **Set up a CI gate**: Run `eval_runner.py` in CI and fail the build if any experiment row has `tool_health < 1.0`. Braintrust's `experiment.summarize()` returns aggregate scores you can threshold in a shell script.

## FAQ

### How does the LoggedTool wrapper capture tool invocations?

LoggedTool wraps any CrewAI BaseTool and intercepts its _run method, recording the input, output, any error, and latency before returning the result to the agent. All records are appended to a shared call_log list that becomes the Braintrust experiment dataset.

### What does the tool_health_scorer check?

The scorer returns 1.0 if the output does not start with ERROR, is non-empty, and completes in under 500 ms; otherwise it returns 0.0 and records the reason (error prefix, empty output, or high latency).

### How can you replay a failing tool call without re-running the agent?

The replay.py script filters the call_log for rows where output starts with ERROR, then re-invokes each tool directly with its original input arguments. This isolates the tool behavior from the agent's reasoning.

### What does the version diff script show?

The diff_versions function computes per-tool average health scores for two versions and prints a side-by-side table with the delta, letting you see which tools improved or regressed between versions.

### Why is Braintrust's experiment API useful for CrewAI debugging?

It logs each tool call as a scored row, allows deterministic replay of the same dataset across versions, and provides a visual diff in the UI. Without it, debugging requires reading raw logs and guessing which of many tool calls caused a regression.

## References

1. https://github.com/vercel-labs/open-agents
2. https://arxiv.org/abs/2605.06320v1
