Agent evals that live in notebooks are not evals. They’re demos. The moment you ship a LangGraph coding agent to production, you need a repeatable pipeline that runs the same tasks, captures the same trajectory data, and scores against the same rubric on every iteration.
This tutorial builds that pipeline from scratch. You’ll define a small dataset of coding tasks, run a LangGraph agent against each one inside an isolated sandbox environment, capture the full tool-call trajectory, and push scores to Braintrust for tracking over time.
Why this matters
OpenSandbox [1] entered the CNCF Landscape as a general-purpose sandbox runtime for AI agents, offering Docker and Kubernetes backends, per-sandbox egress controls, and built-in command and filesystem execution APIs. That runtime fills a concrete gap: running untrusted agent-generated code safely during evaluation without spinning up a full Kubernetes cluster for every experiment.
Most teams today evaluate coding agents by eyeballing outputs or writing brittle string-match tests. Neither approach scales. When your agent starts calling tools in multi-step chains, what matters is not just the final answer but the trajectory: did it take the shortest path, did it retry unnecessarily, did it call the right tools in the right order? Without a harness that captures and scores trajectories, regressions in tool-call efficiency are invisible until a user complains.
Braintrust’s dataset-driven scoring API gives you a structured place to store expected outputs, run experiments, and compare scores across model versions or prompt changes. Combined with OpenSandbox’s isolated execution environment, you get evals that are both safe and reproducible.
Prerequisites
- Python 3.11 or 3.12
- A Braintrust account with an API key (free tier works for this tutorial)
- Familiarity with LangGraph’s graph construction API
- Basic understanding of OpenTelemetry spans (helpful but not required)
- An OpenAI or Anthropic API key for the agent’s LLM calls
Setup
Install the required packages. The harness uses LangGraph for the agent, Braintrust’s SDK for experiment tracking, and the OpenSandbox Python SDK [1] for isolated code execution.
uv pip install langgraph langchain-core braintrust autoevals opensandbox
Verify the installs:
from importlib.metadata import version
for pkg in ["langgraph", "braintrust", "autoevals", "opensandbox"]:
try:
print(f"{pkg}: {version(pkg)}")
except Exception:
print(f"{pkg}: not found")
Step 1: Define the Eval Dataset
A good eval dataset for a coding agent contains tasks with known correct outputs and a clear rubric for what “correct” means. Each task specifies the prompt, the expected final answer, and the maximum number of tool calls you’d accept for an efficient solution.
# filename: eval_dataset.py
from typing import TypedDict
class EvalTask(TypedDict):
id: str
prompt: str
expected_output: str
max_tool_calls: int
tags: list[str]
EVAL_TASKS: list[EvalTask] = [
{
"id": "task_001",
"prompt": "Write a Python function called `add` that takes two integers and returns their sum. Then call it with 3 and 4 and print the result.",
"expected_output": "7",
"max_tool_calls": 2,
"tags": ["arithmetic", "function-definition"],
},
{
"id": "task_002",
"prompt": "Write a Python script that computes the factorial of 5 using recursion and prints the result.",
"expected_output": "120",
"max_tool_calls": 2,
"tags": ["recursion", "math"],
},
{
"id": "task_003",
"prompt": "Create a list of the first 10 Fibonacci numbers starting from 0 and print them as a comma-separated string.",
"expected_output": "0,1,1,2,3,5,8,13,21,34",
"max_tool_calls": 2,
"tags": ["sequences", "list-operations"],
},
{
"id": "task_004",
"prompt": "Write a Python one-liner that reverses the string 'hello world' and prints it.",
"expected_output": "dlrow olleh",
"max_tool_calls": 1,
"tags": ["strings", "one-liner"],
},
]
Step 2: Build the Trajectory Capture Layer
Before writing the agent, build the data structures that will hold trajectory information. A trajectory is the ordered sequence of tool calls the agent made, along with their inputs, outputs, and timing.
# filename: trajectory.py
import time
from dataclasses import dataclass, field
from typing import Any
@dataclass
class ToolCall:
tool_name: str
tool_input: dict[str, Any]
tool_output: str
duration_ms: float
success: bool
@dataclass
class AgentTrajectory:
task_id: str
prompt: str
final_answer: str
tool_calls: list[ToolCall] = field(default_factory=list)
total_duration_ms: float = 0.0
error: str | None = None
def add_tool_call(
self,
tool_name: str,
tool_input: dict[str, Any],
tool_output: str,
duration_ms: float,
success: bool = True,
) -> None:
self.tool_calls.append(
ToolCall(
tool_name=tool_name,
tool_input=tool_input,
tool_output=tool_output,
duration_ms=duration_ms,
success=success,
)
)
@property
def tool_call_count(self) -> int:
return len(self.tool_calls)
@property
def failed_tool_calls(self) -> int:
return sum(1 for tc in self.tool_calls if not tc.success)
def to_dict(self) -> dict[str, Any]:
return {
"task_id": self.task_id,
"prompt": self.prompt,
"final_answer": self.final_answer,
"tool_call_count": self.tool_call_count,
"failed_tool_calls": self.failed_tool_calls,
"total_duration_ms": self.total_duration_ms,
"tool_calls": [
{
"tool_name": tc.tool_name,
"tool_input": tc.tool_input,
"tool_output": tc.tool_output,
"duration_ms": tc.duration_ms,
"success": tc.success,
}
for tc in self.tool_calls
],
"error": self.error,
}
Step 3: Build the Sandbox-Backed Code Execution Tool
The agent needs a tool that executes Python code. In production this would call the OpenSandbox API [1] to run code inside an isolated container. For the sandbox environment in this tutorial (which has no Docker daemon), the tool falls back to a safe subprocess-based executor with a timeout. The interface is identical to what you’d wire to OpenSandbox’s command execution API.
# filename: code_tool.py
import subprocess
import sys
import textwrap
import time
from typing import Any
from trajectory import AgentTrajectory
def execute_python_code(
code: str,
trajectory: AgentTrajectory,
timeout_seconds: int = 10,
) -> str:
"""
Execute Python code and record the tool call in the trajectory.
In production, replace the subprocess.run call with:
sandbox.command.run(["python", "-c", code])
where `sandbox` is an opensandbox.Sandbox instance. The trajectory
recording logic stays identical.
"""
start = time.perf_counter()
cleaned = textwrap.dedent(code).strip()
success = True
output = ""
try:
result = subprocess.run(
[sys.executable, "-c", cleaned],
capture_output=True,
text=True,
timeout=timeout_seconds,
)
if result.returncode == 0:
output = result.stdout.strip()
else:
output = f"ERROR: {result.stderr.strip()}"
success = False
except subprocess.TimeoutExpired:
output = f"ERROR: execution timed out after {timeout_seconds}s"
success = False
except Exception as exc:
output = f"ERROR: {exc}"
success = False
duration_ms = (time.perf_counter() - start) * 1000
trajectory.add_tool_call(
tool_name="execute_python",
tool_input={"code": cleaned},
tool_output=output,
duration_ms=duration_ms,
success=success,
)
return output
Verify the tool works in isolation:
from trajectory import AgentTrajectory
from code_tool import execute_python_code
traj = AgentTrajectory(task_id="test", prompt="test", final_answer="")
result = execute_python_code("print(3 + 4)", traj)
print(f"tool output: {result}")
print(f"tool calls recorded: {traj.tool_call_count}")
print(f"tool success: {traj.tool_calls[0].success}")
Step 4: Build the LangGraph Coding Agent
The agent uses a simple ReAct loop: reason about the task, call the code execution tool, observe the output, and decide whether to continue or return a final answer. The graph is constructed without eagerly instantiating any LLM client, so the structure can be verified without API keys.
# filename: agent.py
import json
import re
import time
from typing import Any, Callable
from langgraph.graph import StateGraph, END
from langchain_core.messages import HumanMessage, AIMessage, ToolMessage
from typing_extensions import TypedDict
from trajectory import AgentTrajectory
from code_tool import execute_python_code
class AgentState(TypedDict):
messages: list
trajectory: AgentTrajectory
task_id: str
max_tool_calls: int
done: bool
def make_reason_node(model: Any) -> Callable:
"""Return a node function that calls the LLM to reason about next action."""
def reason(state: AgentState) -> AgentState:
messages = state["messages"]
response = model.invoke(messages)
return {**state, "messages": messages + [response]}
return reason
def act(state: AgentState) -> AgentState:
"""Parse the last AI message for a code block and execute it."""
messages = state["messages"]
last_msg = messages[-1]
trajectory = state["trajectory"]
# Extract code from markdown fences or raw code
content = last_msg.content if hasattr(last_msg, "content") else str(last_msg)
code_match = re.search(r"```(?:python)?\n(.*?)```", content, re.DOTALL)
if code_match:
code = code_match.group(1).strip()
output = execute_python_code(code, trajectory)
tool_msg = ToolMessage(
content=output,
tool_call_id=f"call_{trajectory.tool_call_count}",
)
return {**state, "messages": messages + [tool_msg]}
else:
# No code block found; treat the response as the final answer
trajectory.final_answer = content.strip()
return {**state, "done": True}
def should_continue(state: AgentState) -> str:
if state.get("done"):
return END
if state["trajectory"].tool_call_count >= state["max_tool_calls"] + 1:
# Force termination if the agent exceeds the allowed tool calls
state["trajectory"].final_answer = (
state["messages"][-1].content
if state["messages"]
else "max tool calls exceeded"
)
return END
return "reason"
def build_graph(model: Any) -> Any:
"""Build and compile the LangGraph agent graph."""
graph = StateGraph(AgentState)
graph.add_node("reason", make_reason_node(model))
graph.add_node("act", act)
graph.set_entry_point("reason")
graph.add_edge("reason", "act")
graph.add_conditional_edges("act", should_continue, {END: END, "reason": "reason"})
return graph.compile()
SYSTEM_PROMPT = """You are a Python coding agent. For each task:
1. Write a Python code block to solve the problem.
2. Observe the output.
3. If the output is correct, state the final answer clearly.
4. If not, revise and try again.
Always wrap code in ```python ... ``` fences.
When you have the final answer, state it on its own line prefixed with 'ANSWER:'."""
def run_agent(
task_id: str,
prompt: str,
max_tool_calls: int,
model: Any,
) -> AgentTrajectory:
"""Run the agent on a single task and return the trajectory."""
trajectory = AgentTrajectory(
task_id=task_id,
prompt=prompt,
final_answer="",
)
start = time.perf_counter()
initial_state: AgentState = {
"messages": [
HumanMessage(content=f"{SYSTEM_PROMPT}\n\nTask: {prompt}"),
],
"trajectory": trajectory,
"task_id": task_id,
"max_tool_calls": max_tool_calls,
"done": False,
}
app = build_graph(model)
try:
final_state = app.invoke(initial_state)
# Extract final answer from last AI message if not already set
if not trajectory.final_answer:
for msg in reversed(final_state["messages"]):
if hasattr(msg, "content") and isinstance(msg, AIMessage):
content = msg.content
if "ANSWER:" in content:
trajectory.final_answer = content.split("ANSWER:")[-1].strip()
else:
trajectory.final_answer = content.strip()
break
except Exception as exc:
trajectory.error = str(exc)
trajectory.final_answer = ""
trajectory.total_duration_ms = (time.perf_counter() - start) * 1000
return trajectory
Verify the graph structure without an API key:
from unittest.mock import MagicMock
from agent import build_graph
mock_model = MagicMock()
mock_model.invoke.return_value = MagicMock(content="no code here")
app = build_graph(mock_model)
nodes = list(app.get_graph().nodes)
print("Graph nodes:", sorted(nodes))
assert "reason" in nodes
assert "act" in nodes
print("graph_structure_ok")
Step 5: Build the Scoring Functions
Two metrics matter for a coding agent: correctness (did the output match the expected answer?) and tool-call efficiency (did the agent reach the answer without unnecessary retries?).
# filename: scorers.py
import re
from trajectory import AgentTrajectory
def score_correctness(trajectory: AgentTrajectory, expected_output: str) -> float:
"""
Returns 1.0 if any tool output or the final answer contains the expected
output, 0.0 otherwise. Strips whitespace and normalizes case for comparison.
"""
expected = expected_output.strip().lower()
# Check final answer
if expected in trajectory.final_answer.strip().lower():
return 1.0
# Check tool outputs (the agent may have printed the answer in a code block)
for tc in trajectory.tool_calls:
if expected in tc.tool_output.strip().lower():
return 1.0
return 0.0
def score_efficiency(trajectory: AgentTrajectory, max_tool_calls: int) -> float:
"""
Returns 1.0 if the agent used at most max_tool_calls tool calls,
scaling down linearly for each extra call, floored at 0.0.
A one-call overage gives 0.5; two or more gives 0.0.
"""
actual = trajectory.tool_call_count
if actual <= max_tool_calls:
return 1.0
overage = actual - max_tool_calls
return max(0.0, 1.0 - (overage * 0.5))
def score_no_failures(trajectory: AgentTrajectory) -> float:
"""
Returns 1.0 if no tool calls failed, 0.0 if any did.
"""
if trajectory.tool_call_count == 0:
return 1.0
return 0.0 if trajectory.failed_tool_calls > 0 else 1.0
def compute_scores(
trajectory: AgentTrajectory,
expected_output: str,
max_tool_calls: int,
) -> dict[str, float]:
return {
"correctness": score_correctness(trajectory, expected_output),
"efficiency": score_efficiency(trajectory, max_tool_calls),
"no_failures": score_no_failures(trajectory),
}
Verify the scorers with a synthetic trajectory:
from trajectory import AgentTrajectory
from scorers import compute_scores
traj = AgentTrajectory(task_id="t1", prompt="add 3+4", final_answer="7")
traj.add_tool_call(
tool_name="execute_python",
tool_input={"code": "print(3+4)"},
tool_output="7",
duration_ms=12.0,
success=True,
)
scores = compute_scores(traj, expected_output="7", max_tool_calls=2)
print("Scores:", scores)
assert scores["correctness"] == 1.0
assert scores["efficiency"] == 1.0
assert scores["no_failures"] == 1.0
# Test efficiency penalty
traj2 = AgentTrajectory(task_id="t2", prompt="add", final_answer="7")
for i in range(4):
traj2.add_tool_call("execute_python", {"code": ""}, "7", 10.0, True)
scores2 = compute_scores(traj2, expected_output="7", max_tool_calls=2)
print("Efficiency with 4 calls (max 2):", scores2["efficiency"])
assert scores2["efficiency"] == 0.0
print("scorers_ok")
Step 6: Build the Braintrust Experiment Runner
The harness loops over the eval dataset, runs the agent on each task, scores the trajectory, and logs results to a Braintrust experiment. Because Braintrust’s init call requires a valid API key, the runner is structured so the scoring logic runs independently of the Braintrust upload, making it testable without credentials.
The scoring logic runs independently of the Braintrust upload, making the harness testable without credentials.
# filename: harness.py
import json
import os
from typing import Any
from eval_dataset import EVAL_TASKS, EvalTask
from trajectory import AgentTrajectory
from scorers import compute_scores
def run_eval_locally(
model: Any,
tasks: list[EvalTask] | None = None,
) -> list[dict[str, Any]]:
"""
Run the eval harness without Braintrust. Returns a list of result dicts.
Useful for CI pipelines that assert on scores directly.
"""
from agent import run_agent
if tasks is None:
tasks = EVAL_TASKS
results = []
for task in tasks:
print(f"Running task {task['id']}: {task['prompt'][:60]}...")
trajectory = run_agent(
task_id=task["id"],
prompt=task["prompt"],
max_tool_calls=task["max_tool_calls"],
model=model,
)
scores = compute_scores(
trajectory,
expected_output=task["expected_output"],
max_tool_calls=task["max_tool_calls"],
)
result = {
"task_id": task["id"],
"scores": scores,
"trajectory": trajectory.to_dict(),
"tags": task["tags"],
}
results.append(result)
print(
f" correctness={scores['correctness']:.2f} "
f"efficiency={scores['efficiency']:.2f} "
f"no_failures={scores['no_failures']:.2f}"
)
return results
def run_eval_with_braintrust(
model: Any,
experiment_name: str,
tasks: list[EvalTask] | None = None,
) -> None:
"""
Run the eval harness and log results to a Braintrust experiment.
Requires BRAINTRUST_API_KEY to be set in the environment.
"""
import braintrust
from agent import run_agent
if tasks is None:
tasks = EVAL_TASKS
api_key = os.environ.get("BRAINTRUST_API_KEY")
if not api_key:
raise EnvironmentError(
"BRAINTRUST_API_KEY must be set to use run_eval_with_braintrust"
)
experiment = braintrust.init(
project="coding-agent-evals",
experiment=experiment_name,
api_key=api_key,
)
for task in tasks:
print(f"Running task {task['id']}...")
trajectory = run_agent(
task_id=task["id"],
prompt=task["prompt"],
max_tool_calls=task["max_tool_calls"],
model=model,
)
scores = compute_scores(
trajectory,
expected_output=task["expected_output"],
max_tool_calls=task["max_tool_calls"],
)
experiment.log(
input={"prompt": task["prompt"]},
output=trajectory.final_answer,
expected=task["expected_output"],
scores=scores,
metadata={
"task_id": task["id"],
"tool_call_count": trajectory.tool_call_count,
"total_duration_ms": trajectory.total_duration_ms,
"tags": task["tags"],
"tool_calls": [
{
"name": tc.tool_name,
"duration_ms": tc.duration_ms,
"success": tc.success,
}
for tc in trajectory.tool_calls
],
},
)
experiment.flush()
print(f"Experiment '{experiment_name}' logged to Braintrust.")
Step 7: Wire a Mock Model for CI
For CI pipelines without LLM API keys, use a deterministic mock model that returns correct code for each task. This lets the harness validate scoring logic and trajectory capture without any external calls.
# filename: mock_model.py
import re
from langchain_core.messages import AIMessage
CODE_RESPONSES = {
"add": "```python\nprint(3 + 4)\n```\nANSWER: 7",
"factorial": "```python\ndef factorial(n):\n return 1 if n <= 1 else n * factorial(n - 1)\nprint(factorial(5))\n```\nANSWER: 120",
"fibonacci": "```python\nfibs = [0, 1]\nfor _ in range(8):\n fibs.append(fibs[-1] + fibs[-2])\nprint(','.join(str(x) for x in fibs))\n```\nANSWER: 0,1,1,2,3,5,8,13,21,34",
"reverse": "```python\nprint('hello world'[::-1])\n```\nANSWER: dlrow olleh",
}
class MockCodingModel:
"""Deterministic mock that returns correct code for each task type."""
def invoke(self, messages: list) -> AIMessage:
last_content = ""
for msg in reversed(messages):
if hasattr(msg, "content"):
last_content = msg.content.lower()
break
for keyword, response in CODE_RESPONSES.items():
if keyword in last_content:
return AIMessage(content=response)
# Default: return a no-op that signals completion
return AIMessage(content="ANSWER: done")
Verify it works
Run the full harness end-to-end using the mock model. Every task should score 1.0 on correctness.
from mock_model import MockCodingModel
from harness import run_eval_locally
from eval_dataset import EVAL_TASKS
model = MockCodingModel()
results = run_eval_locally(model, tasks=EVAL_TASKS)
print("\n=== Eval Summary ===")
total_correctness = 0.0
for r in results:
c = r["scores"]["correctness"]
e = r["scores"]["efficiency"]
total_correctness += c
print(f"{r['task_id']}: correctness={c:.2f} efficiency={e:.2f} tool_calls={r['trajectory']['tool_call_count']}")
avg = total_correctness / len(results)
print(f"\nAverage correctness: {avg:.2f}")
assert avg >= 0.75, f"Expected avg correctness >= 0.75, got {avg:.2f}"
print("eval_harness_ok")
To run with Braintrust and a real LLM, set your environment variables and call run_eval_with_braintrust instead:
# This block requires BRAINTRUST_API_KEY and an LLM API key.
# Shown here for reference; skip in CI without credentials.
import os
from langchain_openai import ChatOpenAI
from harness import run_eval_with_braintrust
model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
run_eval_with_braintrust(
model=model,
experiment_name="coding-agent-v1",
)
To swap in OpenSandbox [1] for real isolated execution, replace the subprocess.run call in code_tool.py with:
# OpenSandbox integration (requires a running OpenSandbox server).
# Replace the subprocess block in execute_python_code with:
import opensandbox
# client = opensandbox.Client(domain="localhost:8080", protocol="http")
# sandbox = client.sandbox.create(image="python:3.12", timeout="10m")
# result = sandbox.command.run(["python", "-c", cleaned])
# output = result.stdout.strip() if result.exit_code == 0 else f"ERROR: {result.stderr}"
# sandbox.stop()
Troubleshooting
ModuleNotFoundError: No module named 'braintrust' after install. Run uv pip install braintrust again and confirm the install succeeded with from importlib.metadata import version; print(version('braintrust')). If you’re in a virtual environment, make sure it’s activated before running the harness.
EnvironmentError: BRAINTRUST_API_KEY must be set when calling run_eval_with_braintrust. Export the key before running: export BRAINTRUST_API_KEY=your_key_here. For CI, add it as a repository secret and inject it into the environment at run time.
Agent loops indefinitely or hits max tool calls on every task. The mock model’s keyword matching may not find a match for your task prompt. Check that the task prompt contains one of the keywords in CODE_RESPONSES (add, factorial, fibonacci, reverse). For real LLM runs, increase max_tool_calls in the task definition or improve the system prompt.
Tool output is empty or ERROR: execution timed out. The subprocess executor defaults to a 10-second timeout. For tasks that generate large outputs or run slow algorithms, pass timeout_seconds=30 to execute_python_code. With OpenSandbox [1], set the sandbox timeout parameter at creation time.
Braintrust experiment shows no scores. Call experiment.flush() before the process exits. The Braintrust SDK buffers log calls and may not flush on process exit in short-lived scripts.
Scores are all 0.0 for correctness even though the agent printed the right answer. The score_correctness function checks both trajectory.final_answer and each tool output. If the agent’s final message doesn’t contain ANSWER:, final_answer may be set to the full LLM response including explanation text. Tighten the system prompt to require the ANSWER: prefix, or adjust the extraction logic in run_agent.
Next steps
- Add a dataset versioning layer. Store your
EVAL_TASKSlist in a Braintrust dataset object usingbraintrust.init_dataset(...)so you can track which dataset version each experiment ran against and diff task sets over time. - Integrate OpenSandbox’s Kubernetes runtime [1]. For large-scale parallel evals, replace the local subprocess executor with OpenSandbox’s distributed scheduler. Each task gets its own isolated container, and you can run hundreds of tasks concurrently without resource contention.
- Add LLM-as-judge scoring. Supplement the string-match correctness scorer with a secondary scorer that uses a small LLM to evaluate semantic correctness. Braintrust’s
autoevalslibrary ships aLLMClassifierthat integrates directly with the experiment logging API. - Wire the harness into CI. Add a GitHub Actions step that runs
run_eval_locallywith the mock model on every pull request and fails the build if average correctness drops below a threshold. Reserve the full Braintrust experiment run for nightly or pre-release builds.
FAQ
How does the harness capture agent trajectories?
The harness records each tool call the agent makes in an AgentTrajectory object, storing the tool name, input, output, duration, and success status. This trajectory data is then passed to scoring functions and logged to Braintrust for tracking over time.
What scoring metrics does the harness use?
The harness scores on three metrics: correctness (whether the output matches the expected answer), efficiency (whether the agent used at most the allowed number of tool calls), and no_failures (whether any tool calls failed). Efficiency scales linearly, penalizing each extra tool call by 0.5.
Can the harness run without Braintrust API credentials?
Yes. The run_eval_locally function runs the full eval pipeline independently of Braintrust, making it suitable for CI pipelines that assert on scores directly without external API calls. The Braintrust integration is optional and only required for experiment tracking.
How does the harness execute untrusted agent code safely?
The tutorial uses a subprocess-based executor with a timeout for local testing, but the code_tool.py module is designed to swap in OpenSandbox’s isolated container runtime for production. OpenSandbox provides Docker and Kubernetes backends with per-sandbox egress controls.
What is the mock model used for in CI?
The MockCodingModel provides deterministic, correct responses for each task type without requiring LLM API keys. This allows CI pipelines to validate the harness structure, scoring logic, and trajectory capture without external dependencies.