# Building a Multi-Agent Eval Harness with LangGraph and LangSmith

> Build a two-agent LangGraph pipeline, then wire a LangSmith eval harness around it that scores each node's output on correctness and latency. The result is a reproducible Python project with two evaluator functions you can drop straight into CI.

- Canonical URL: https://agentry.press/tutorial/building-a-multi-agent-eval-harness-with-langgraph-and-langsmith/
- Type: Tutorial
- Published: 2026-05-31
- By: agentry
- Tags: langgraph, langsmith, multi-agent, evaluation, ci

---

## Why this matters

The LATTE paper [2] demonstrates that fixed-role multi-agent pipelines waste tokens and wall-clock time because they can't adapt coordination mid-task. The paper's core insight is that agents should maintain a shared, evolving task graph rather than executing a static decomposition. LangGraph's node-and-edge model maps directly onto that idea: each node is an agent, edges encode dependencies, and the state object is the shared coordination surface.

The missing piece for most teams is evaluation. Without per-node scoring, you can't tell whether a latency regression comes from a slow planner node or a slow executor node, and you can't catch correctness regressions before they reach production. LangSmith's `evaluate` API [2] lets you attach evaluator functions to a dataset and get structured results back in a single call, making it straightforward to embed in CI.

This tutorial builds a minimal but realistic harness: a two-node LangGraph pipeline (planner + executor), a LangSmith dataset with ground-truth examples, and two evaluator functions (correctness and latency) that produce a JSON report you can `assert` against in pytest.

## Prerequisites

- Python 3.11 or 3.12
- A LangSmith account and API key (`LANGSMITH_API_KEY`)
- Familiarity with LangGraph basics (nodes, edges, `StateGraph`)
- Basic pytest knowledge

## Setup

Install the required packages. `langsmith` ships the `evaluate` API and dataset management. `langgraph` provides the graph runtime. `langchain-core` supplies the message primitives both libraries share.

```bash
uv pip install langgraph langsmith langchain-core
```

Export your LangSmith credentials. The eval harness reads these at runtime to push datasets and pull results.

```bash
export LANGSMITH_API_KEY="your-langsmith-api-key"
export LANGSMITH_TRACING=true
```

## Step 1: Define the shared state and agent nodes

The pipeline has two nodes: a **planner** that breaks a question into a numbered plan, and an **executor** that answers the question given that plan. Both nodes receive and return the same `AgentState` TypedDict, which is the shared coordination surface described in LATTE [2].

The nodes are written as plain functions that accept a state dict and return a partial update. No LLM client is constructed at module load time, so the module imports cleanly in the sandbox without credentials.

```python
# filename: pipeline.py
from __future__ import annotations

import time
from typing import Any, TypedDict

from langgraph.graph import StateGraph, END


class AgentState(TypedDict):
    question: str
    plan: str
    answer: str
    planner_latency_ms: float
    executor_latency_ms: float


def planner_node(state: AgentState, model_fn=None) -> dict[str, Any]:
    """Produce a numbered plan for the question."""
    t0 = time.perf_counter()
    question = state["question"]
    if model_fn is not None:
        plan = model_fn(f"Write a short numbered plan to answer: {question}")
    else:
        # Deterministic stub used in tests and CI without an LLM key.
        plan = (
            "1. Identify the key concepts in the question.\n"
            "2. Retrieve relevant facts.\n"
            "3. Synthesise a concise answer."
        )
    elapsed = (time.perf_counter() - t0) * 1000
    return {"plan": plan, "planner_latency_ms": round(elapsed, 2)}


def executor_node(state: AgentState, model_fn=None) -> dict[str, Any]:
    """Answer the question using the plan."""
    t0 = time.perf_counter()
    question = state["question"]
    plan = state.get("plan", "")
    if model_fn is not None:
        prompt = f"Plan:\n{plan}\n\nQuestion: {question}\nAnswer:"
        answer = model_fn(prompt)
    else:
        # Deterministic stub: echo the question back as a simple answer.
        answer = f"Answer to '{question}': 42 (stub)"
    elapsed = (time.perf_counter() - t0) * 1000
    return {"answer": answer, "executor_latency_ms": round(elapsed, 2)}


def build_graph(model_fn=None):
    """Compile the two-node LangGraph pipeline.

    model_fn: optional callable(prompt: str) -> str injected into both nodes.
    Passing None uses the deterministic stubs, which is the default for CI.
    """
    builder = StateGraph(AgentState)

    builder.add_node("planner", lambda s: planner_node(s, model_fn))
    builder.add_node("executor", lambda s: executor_node(s, model_fn))

    builder.set_entry_point("planner")
    builder.add_edge("planner", "executor")
    builder.add_edge("executor", END)

    return builder.compile()
```

Verify the graph compiles and has the expected nodes:

```python
from pipeline import build_graph

graph = build_graph()  # no model_fn -> uses stubs
nodes = list(graph.get_graph().nodes.keys())
print("nodes:", nodes)
assert "planner" in nodes
assert "executor" in nodes
print("graph_compile_ok")
```

## Step 2: Run the pipeline on a sample input

Invoke the compiled graph with a sample question to confirm the state flows through both nodes and both latency fields are populated.

```python
from pipeline import build_graph

graph = build_graph()
result = graph.invoke({
    "question": "What is the capital of France?",
    "plan": "",
    "answer": "",
    "planner_latency_ms": 0.0,
    "executor_latency_ms": 0.0,
})

print("plan:", result["plan"][:60])
print("answer:", result["answer"])
print("planner_latency_ms:", result["planner_latency_ms"])
print("executor_latency_ms:", result["executor_latency_ms"])
assert result["planner_latency_ms"] >= 0
assert result["executor_latency_ms"] >= 0
assert "Answer to" in result["answer"]
print("pipeline_run_ok")
```

## Step 3: Define the evaluator functions

LangSmith evaluators are plain Python callables with the signature `(run, example) -> dict`. The `run` object carries the pipeline's outputs; `example` carries the ground-truth reference from the dataset.

Two evaluators are defined here:

1. **correctness_evaluator** checks whether the expected keyword appears in the answer.
2. **latency_evaluator** checks that total pipeline latency stays under a configurable threshold.

> [!PULLQUOTE]
> Evaluator functions are plain Python callables. No framework lock-in, no special base class: if it returns a dict with a `score` key, LangSmith accepts it.

```python
# filename: evaluators.py
from __future__ import annotations

LATENCY_THRESHOLD_MS = 2000  # CI budget: total pipeline must finish in 2 s


def correctness_evaluator(run, example) -> dict:
    """Score 1 if the expected keyword appears in the answer, else 0."""
    outputs = run.outputs or {}
    reference = example.outputs or {}

    answer = outputs.get("answer", "").lower()
    expected_keyword = reference.get("expected_keyword", "").lower()

    if not expected_keyword:
        return {"key": "correctness", "score": None, "comment": "no reference keyword"}

    score = 1 if expected_keyword in answer else 0
    return {
        "key": "correctness",
        "score": score,
        "comment": f"keyword='{expected_keyword}' found={bool(score)}",
    }


def latency_evaluator(run, example) -> dict:
    """Score 1 if total latency is under threshold, else 0."""
    outputs = run.outputs or {}
    planner_ms = outputs.get("planner_latency_ms", 0.0)
    executor_ms = outputs.get("executor_latency_ms", 0.0)
    total_ms = planner_ms + executor_ms

    score = 1 if total_ms < LATENCY_THRESHOLD_MS else 0
    return {
        "key": "latency",
        "score": score,
        "comment": f"total={total_ms:.1f}ms threshold={LATENCY_THRESHOLD_MS}ms",
    }
```

Confirm the evaluators return the right shape using mock objects:

```python
from evaluators import correctness_evaluator, latency_evaluator

class MockRun:
    outputs = {
        "answer": "Answer to 'What is the capital of France?': 42 (stub)",
        "planner_latency_ms": 1.2,
        "executor_latency_ms": 0.8,
    }

class MockExample:
    outputs = {"expected_keyword": "stub"}

run = MockRun()
example = MockExample()

cr = correctness_evaluator(run, example)
lr = latency_evaluator(run, example)

print("correctness:", cr)
print("latency:", lr)
assert cr["key"] == "correctness"
assert cr["score"] == 1
assert lr["key"] == "latency"
assert lr["score"] == 1
print("evaluators_ok")
```

## Step 4: Build the eval harness

The harness does three things:

1. Creates (or reuses) a LangSmith dataset with ground-truth examples.
2. Wraps the LangGraph pipeline as a `target` function that LangSmith can call.
3. Runs `client.evaluate(target, data=dataset_name, evaluators=[...])` and writes a JSON report.

The harness is written as a module so it can be imported by pytest or run directly.

```python
# filename: harness.py
from __future__ import annotations

import json
import os
from typing import Any

from langsmith import Client

from evaluators import correctness_evaluator, latency_evaluator
from pipeline import build_graph

DATASET_NAME = "multi-agent-eval-tutorial"

EXAMPLES = [
    {
        "inputs": {
            "question": "What is the capital of France?",
            "plan": "",
            "answer": "",
            "planner_latency_ms": 0.0,
            "executor_latency_ms": 0.0,
        },
        "outputs": {"expected_keyword": "stub"},
    },
    {
        "inputs": {
            "question": "What is 2 + 2?",
            "plan": "",
            "answer": "",
            "planner_latency_ms": 0.0,
            "executor_latency_ms": 0.0,
        },
        "outputs": {"expected_keyword": "stub"},
    },
    {
        "inputs": {
            "question": "Name a primary colour.",
            "plan": "",
            "answer": "",
            "planner_latency_ms": 0.0,
            "executor_latency_ms": 0.0,
        },
        "outputs": {"expected_keyword": "stub"},
    },
]


def get_or_create_dataset(client: Client) -> str:
    """Return the dataset name, creating it with examples if it doesn't exist."""
    datasets = list(client.list_datasets(dataset_name=DATASET_NAME))
    if datasets:
        print(f"Reusing existing dataset: {DATASET_NAME}")
        return DATASET_NAME

    print(f"Creating dataset: {DATASET_NAME}")
    dataset = client.create_dataset(DATASET_NAME)
    for ex in EXAMPLES:
        client.create_example(
            inputs=ex["inputs"],
            outputs=ex["outputs"],
            dataset_id=dataset.id,
        )
    print(f"Uploaded {len(EXAMPLES)} examples.")
    return DATASET_NAME


def pipeline_target(inputs: dict[str, Any]) -> dict[str, Any]:
    """Wrap the LangGraph pipeline for LangSmith's evaluate() call."""
    graph = build_graph()  # stubs; swap in model_fn for a live run
    return graph.invoke(inputs)


def run_eval(report_path: str = "/workspace/eval_report.json") -> dict:
    """Run the full eval harness and write a JSON report."""
    api_key = os.environ.get("LANGSMITH_API_KEY", "")
    if not api_key:
        raise EnvironmentError("LANGSMITH_API_KEY is not set.")

    client = Client()
    dataset_name = get_or_create_dataset(client)

    results = client.evaluate(
        pipeline_target,
        data=dataset_name,
        evaluators=[correctness_evaluator, latency_evaluator],
        experiment_prefix="tutorial-ci",
        max_concurrency=1,
    )

    rows = []
    for r in results:
        row = {
            "run_id": str(r.get("run").id) if r.get("run") else None,
            "scores": {ev["key"]: ev["score"] for ev in (r.get("evaluation_results", {}).get("results") or [])},
        }
        rows.append(row)

    report = {"dataset": dataset_name, "num_examples": len(rows), "results": rows}
    with open(report_path, "w") as f:
        json.dump(report, f, indent=2)
    print(f"Report written to {report_path}")
    return report


if __name__ == "__main__":
    report = run_eval()
    print(json.dumps(report, indent=2))
```

## Step 5: Write the pytest integration

The pytest file imports the harness and asserts that every example passes both evaluators. This is the block you add to your CI pipeline.

```python
# filename: test_eval_harness.py
import json
import os
import pytest


def _load_report(path: str = "/workspace/eval_report.json") -> dict:
    with open(path) as f:
        return json.load(f)


@pytest.mark.skipif(
    not os.environ.get("LANGSMITH_API_KEY"),
    reason="LANGSMITH_API_KEY not set; skipping live eval",
)
def test_run_eval_produces_report(tmp_path):
    from harness import run_eval
    report_path = str(tmp_path / "report.json")
    report = run_eval(report_path=report_path)
    assert report["num_examples"] > 0, "No examples evaluated"


def test_report_correctness_scores(tmp_path):
    """Simulate a report and assert all correctness scores are 1."""
    fake_report = {
        "dataset": "multi-agent-eval-tutorial",
        "num_examples": 3,
        "results": [
            {"run_id": "a", "scores": {"correctness": 1, "latency": 1}},
            {"run_id": "b", "scores": {"correctness": 1, "latency": 1}},
            {"run_id": "c", "scores": {"correctness": 1, "latency": 1}},
        ],
    }
    report_path = str(tmp_path / "report.json")
    with open(report_path, "w") as f:
        json.dump(fake_report, f)

    report = _load_report(report_path)
    for row in report["results"]:
        assert row["scores"]["correctness"] == 1, f"Correctness failed: {row}"
        assert row["scores"]["latency"] == 1, f"Latency failed: {row}"
    print("all_scores_pass")


def test_latency_evaluator_fails_slow_run():
    """Confirm the latency evaluator correctly scores a slow run as 0."""
    from evaluators import latency_evaluator

    class SlowRun:
        outputs = {"planner_latency_ms": 1500.0, "executor_latency_ms": 800.0}

    class AnyExample:
        outputs = {}

    result = latency_evaluator(SlowRun(), AnyExample())
    assert result["score"] == 0, "Expected latency score 0 for slow run"
    print("latency_fail_case_ok")
```

## Verify it works

Run the two tests that don't require a LangSmith key. They exercise the evaluator logic and the report-assertion pattern end-to-end.

```bash
python -m pytest test_eval_harness.py::test_report_correctness_scores test_eval_harness.py::test_latency_evaluator_fails_slow_run -v
```

To run the full live eval (requires `LANGSMITH_API_KEY`):

```bash
python -m pytest test_eval_harness.py -v
```

You can also invoke the harness directly to get the JSON report:

```bash
python harness.py
```

## Troubleshooting

**`EnvironmentError: LANGSMITH_API_KEY is not set`** — Export the variable before running: `export LANGSMITH_API_KEY="ls-..."`. The harness checks for it explicitly so the error surfaces early rather than as an obscure HTTP 401.

**`langsmith.utils.LangSmithNotFoundError` when listing datasets** — The `list_datasets` call filters by name; if the API returns an empty iterator the harness creates a new dataset. If you see this error on `create_example`, your API key may not have write permissions on the project. Check the LangSmith UI under Settings > API Keys.

**`ModuleNotFoundError: No module named 'pipeline'`** — Run pytest from `/workspace` (or the directory where `pipeline.py` lives). The test file uses a relative import. Add a `conftest.py` with `sys.path.insert(0, ".")` if your CI runner changes directories.

**Latency evaluator always scores 0 in CI** — The `LATENCY_THRESHOLD_MS` constant in `evaluators.py` defaults to 2000 ms. Slow CI runners or cold-start containers can exceed this even with stubs. Raise the threshold or parameterise it via an environment variable for CI.

**`client.evaluate` returns an iterator that's already exhausted** — The `results` object from `client.evaluate` is a lazy iterator. Iterating it once in `run_eval` exhausts it. If you need to inspect results after writing the report, collect them into a list first: `rows = list(results)`.

**Graph node not found error after editing `pipeline.py`** — LangGraph caches compiled graphs in memory within a process. If you modify node names and re-import without restarting the Python process, the old compiled graph may still be referenced. Restart the interpreter or call `build_graph()` fresh each time.

## Next steps

- **Add an LLM-graded evaluator** — Replace the keyword-match in `correctness_evaluator` with a call to a judge model (e.g. GPT-4o or Claude) that scores semantic correctness on a 0-1 scale. LangSmith's `evaluate` API accepts any callable, so the swap is a one-line change.
- **Map LATTE's adaptive task graph onto LangGraph** — The LATTE paper [2] shows that agents sharing a mutable coordination graph reduce token usage and wall-clock time. Extend `AgentState` with a `task_graph` field and let the planner node update it dynamically, then add a third evaluator that scores coordination efficiency (tokens per correct answer).
- **Integrate with GitHub Actions** — Add a `.github/workflows/eval.yml` that exports `LANGSMITH_API_KEY` from GitHub Secrets, runs `pytest test_eval_harness.py`, and uploads `eval_report.json` as a workflow artifact. The `test_report_correctness_scores` test acts as the pass/fail gate.
- **Version your datasets** — LangSmith supports dataset versioning. Tag each dataset version with the git SHA (`client.create_dataset(name, description=git_sha)`) so eval results are reproducible against a fixed ground-truth snapshot.

## FAQ

### How do you attach evaluators to a LangGraph pipeline in LangSmith?

Wrap the compiled graph in a target function that accepts inputs and returns outputs, then pass that function to `client.evaluate()` along with a list of evaluator callables and a dataset name. LangSmith calls the target on each example and feeds the run outputs and ground-truth reference to each evaluator.

### What signature must an evaluator function have?

An evaluator is a plain Python callable with signature `(run, example) -> dict`. The `run` object carries pipeline outputs, `example` carries ground-truth data from the dataset, and the returned dict must include a `key` and `score` field. No base class or framework lock-in required.

### How do you measure per-node latency in a multi-agent pipeline?

Add latency fields to the shared state dict (e.g., `planner_latency_ms`, `executor_latency_ms`), measure elapsed time in each node using `time.perf_counter()`, and return the rounded milliseconds as part of the node's state update. The evaluator then sums these fields to score total pipeline latency.

### Can the eval harness run without a live LLM?

Yes. The pipeline nodes accept an optional `model_fn` parameter; when None, they use deterministic stubs that return fixed outputs. This allows the harness to run in CI without LLM credentials, and evaluators still score correctness and latency against the stub outputs.

### How do you integrate the eval harness into pytest and CI?

Import the harness module in a pytest file, call `run_eval()` to generate a JSON report, then assert that all rows in the report have correctness and latency scores equal to 1. Add the pytest command to your CI workflow and export `LANGSMITH_API_KEY` from secrets.

## References

1. https://github.com/vercel-labs/open-agents
2. https://arxiv.org/abs/2605.06320v1