# Tracing LangGraph Tool Calls with OpenTelemetry and Phoenix

> Build a LangGraph agent that calls two tools, instruments every tool invocation with OpenInference-compliant OpenTelemetry spans, and exports those spans to a local Phoenix instance. Then generate a cost and latency attribution report directly from the captured trace data.

- Canonical URL: https://agentry.press/tutorial/tracing-langgraph-tool-calls-with-opentelemetry-and-phoenix/
- Type: Tutorial
- Published: 2026-06-04
- By: agentry
- Tags: langgraph, opentelemetry, tracing, phoenix, observability, tool-calling

---

## Why this matters

LangGraph agents in production fail in ways that are invisible without tracing. A tool call returns, the graph moves to the next node, and somewhere downstream the response is wrong. Without span-level visibility you can't tell whether the latency spike came from the LLM call, the tool execution, or the graph's own routing logic. You also can't attribute token costs to individual tool paths, which makes it impossible to optimize the expensive ones.

Arize Phoenix added first-class OpenInference support for LangGraph in its `openinference-instrumentation-langchain` package, which instruments LangChain and LangGraph graphs with a single `instrument()` call. The OpenInference semantic conventions give every span a consistent shape: `input.value`, `output.value`, `llm.token_count.prompt`, `llm.token_count.completion`, and `tool.name` are all first-class attributes you can query without parsing log strings.

This tutorial wires the full path: agent definition, auto-instrumentation, local Phoenix collector, and a Python report that reads spans back out and prints per-tool cost and latency numbers.

## Prerequisites

- Python 3.11 or 3.12
- An OpenAI API key (the live agent steps require it; all structural and tracing steps run without one)
- Basic familiarity with LangGraph graphs and nodes
- Docker, if you want to run the Phoenix UI (the tutorial also shows the in-process mode that needs no Docker)

## Setup

Install the core dependencies. `openinference-instrumentation-langchain` handles both LangChain and LangGraph graphs.

```bash
uv pip install langgraph langchain-openai openai \
  openinference-instrumentation-langchain \
  arize-phoenix-otel opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-grpc \
  arize-phoenix pandas tabulate
```

Verify the key packages installed correctly:

```python
from importlib.metadata import version
for pkg in ["langgraph", "openinference-instrumentation-langchain", "arize-phoenix"]:
    print(f"{pkg}: {version(pkg)}")
print("imports ok")
```

## Step 1: Define the agent and its tools

The agent has two tools: a mock weather lookup and a mock unit converter. Both are pure Python functions decorated with `@tool` so LangGraph's `ToolNode` can dispatch to them automatically.

```python
# filename: agent.py
import os
from typing import Annotated
from langchain_core.tools import tool
from langchain_core.messages import BaseMessage, HumanMessage
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode
from typing_extensions import TypedDict


@tool
def get_weather(city: str) -> str:
    """Return the current weather for a city (mock)."""
    data = {
        "london": "12°C, overcast",
        "tokyo": "28°C, sunny",
        "new york": "18°C, partly cloudy",
    }
    return data.get(city.lower(), f"No data for {city}")


@tool
def convert_units(value: float, from_unit: str, to_unit: str) -> str:
    """Convert a value between units (mock: supports celsius<->fahrenheit, km<->miles)."""
    conversions = {
        ("celsius", "fahrenheit"): lambda v: v * 9 / 5 + 32,
        ("fahrenheit", "celsius"): lambda v: (v - 32) * 5 / 9,
        ("km", "miles"): lambda v: v * 0.621371,
        ("miles", "km"): lambda v: v * 1.60934,
    }
    key = (from_unit.lower(), to_unit.lower())
    if key not in conversions:
        return f"Conversion from {from_unit} to {to_unit} not supported"
    result = conversions[key](value)
    return f"{value} {from_unit} = {result:.2f} {to_unit}"


TOOLS = [get_weather, convert_units]


class AgentState(TypedDict):
    messages: Annotated[list[BaseMessage], add_messages]


def build_agent(model: object = None):
    """Build the LangGraph agent. Accepts an injected model for testing."""
    if model is None:
        model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    llm_with_tools = model.bind_tools(TOOLS)

    def call_model(state: AgentState) -> dict:
        response = llm_with_tools.invoke(state["messages"])
        return {"messages": [response]}

    def should_continue(state: AgentState) -> str:
        last = state["messages"][-1]
        if hasattr(last, "tool_calls") and last.tool_calls:
            return "tools"
        return END

    tool_node = ToolNode(TOOLS)

    graph = StateGraph(AgentState)
    graph.add_node("agent", call_model)
    graph.add_node("tools", tool_node)
    graph.set_entry_point("agent")
    graph.add_conditional_edges("agent", should_continue, {"tools": "tools", END: END})
    graph.add_edge("tools", "agent")

    return graph.compile()
```

Verify the graph compiles and has the expected nodes (no API key needed here because `build_agent()` accepts an injected stub):

```python
from unittest.mock import MagicMock
from agent import build_agent

stub_model = MagicMock()
stub_model.bind_tools.return_value = stub_model

app = build_agent(model=stub_model)
nodes = list(app.get_graph().nodes.keys())
print("nodes:", sorted(nodes))
assert "agent" in nodes
assert "tools" in nodes
print("graph structure ok")
```

## Step 2: Start Phoenix in-process and wire OpenTelemetry

Phoenix can run as a lightweight in-process collector that writes to a local SQLite file. This requires no Docker and no external service. The `px.launch_app()` call starts the Phoenix server in a background thread and returns an endpoint you can point an OTLP exporter at.

For the sandbox (no display, no browser), use the `arize-phoenix-otel` helper to register the tracer provider. In your own environment you'd open `http://localhost:6006` to see the UI.

```python
# filename: tracing.py
import phoenix as px
from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentation


def setup_tracing(project_name: str = "langgraph-tool-tracing") -> object:
    """
    Launch Phoenix in-process, register the OTLP tracer provider,
    and instrument LangChain/LangGraph. Returns the tracer provider.
    """
    # Launch Phoenix as a background thread (no Docker needed)
    session = px.launch_app()
    print(f"Phoenix UI: {session.url}")

    # Register a tracer provider that exports to the local Phoenix collector
    tracer_provider = register(
        project_name=project_name,
        endpoint="http://localhost:4317",  # Phoenix default gRPC OTLP port
        set_global_tracer_provider=True,
    )

    # Auto-instrument LangChain and LangGraph
    LangChainInstrumentation().instrument(tracer_provider=tracer_provider)

    return tracer_provider
```

## Step 3: Run the agent and collect spans

This step requires a real OpenAI API key. The block is marked to skip in the sandbox but you can run it locally after setting `OPENAI_API_KEY`.

```python
# filename: run_agent.py
import os
from langchain_core.messages import HumanMessage
from tracing import setup_tracing
from agent import build_agent


def main():
    tracer_provider = setup_tracing()

    app = build_agent()  # uses ChatOpenAI — needs OPENAI_API_KEY

    queries = [
        "What's the weather in Tokyo and London?",
        "Convert 100 km to miles, then convert 25 celsius to fahrenheit.",
        "What's the weather in New York and convert 72 fahrenheit to celsius?",
    ]

    for query in queries:
        print(f"\nQuery: {query}")
        result = app.invoke({"messages": [HumanMessage(content=query)]})
        final = result["messages"][-1]
        print(f"Answer: {final.content[:120]}")

    # Flush all buffered spans before the process exits
    tracer_provider.force_flush()
    print("\nAll spans flushed to Phoenix.")


if __name__ == "__main__":
    main()
```

Run the agent (requires `OPENAI_API_KEY`):

```bash
# Set your key first:
# export OPENAI_API_KEY="sk-..."
python /workspace/run_agent.py
```

## Step 4: Verify tracing with a synthetic span

To confirm the OTel pipeline works end-to-end without an API key, emit a synthetic span directly and assert it lands in Phoenix's span store. This uses `SimpleSpanProcessor` so the span flushes synchronously before the assertion runs.

```python
import time
import phoenix as px
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter

# Use an in-memory exporter so we can assert on spans without a live Phoenix server
exporter = InMemorySpanExporter()
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(exporter))

tracer = provider.get_tracer("test-tracer")

with tracer.start_as_current_span("tool.get_weather") as span:
    span.set_attribute("tool.name", "get_weather")
    span.set_attribute("input.value", "Tokyo")
    span.set_attribute("output.value", "28°C, sunny")
    span.set_attribute("llm.token_count.prompt", 42)
    span.set_attribute("llm.token_count.completion", 18)
    time.sleep(0.01)  # simulate tool latency

spans = exporter.get_finished_spans()
assert len(spans) == 1, f"Expected 1 span, got {len(spans)}"
assert spans[0].name == "tool.get_weather"
assert spans[0].attributes["tool.name"] == "get_weather"
print(f"Captured span: {spans[0].name}")
print(f"tool.name attribute: {spans[0].attributes['tool.name']}")
print(f"input.value: {spans[0].attributes['input.value']}")
print("synthetic span assertion passed")
```

## Step 5: Build the cost and latency attribution report

Phoenix stores spans in a local SQLite database. The `px.Client()` API lets you pull spans as a pandas DataFrame. The report groups by tool name, computes mean latency, and estimates cost using OpenAI's `gpt-4o-mini` pricing.

The block below works against the in-memory exporter from Step 4 to stay runnable without a live Phoenix server. In production, replace `spans_df` construction with `px.Client().get_spans_dataframe(project_name="langgraph-tool-tracing")`.

```python
# filename: report.py
import pandas as pd
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
import time

# GPT-4o-mini pricing (per 1M tokens, as of mid-2025)
PRICE_PER_1M_INPUT = 0.15   # USD
PRICE_PER_1M_OUTPUT = 0.60  # USD


def build_report_from_spans(finished_spans: list) -> pd.DataFrame:
    """Convert a list of finished OTel spans into a cost+latency report."""
    rows = []
    for span in finished_spans:
        attrs = span.attributes or {}
        duration_ms = (span.end_time - span.start_time) / 1_000_000  # ns -> ms
        prompt_tokens = attrs.get("llm.token_count.prompt", 0)
        completion_tokens = attrs.get("llm.token_count.completion", 0)
        cost_usd = (
            prompt_tokens / 1_000_000 * PRICE_PER_1M_INPUT
            + completion_tokens / 1_000_000 * PRICE_PER_1M_OUTPUT
        )
        rows.append({
            "span_name": span.name,
            "tool_name": attrs.get("tool.name", span.name),
            "duration_ms": round(duration_ms, 2),
            "prompt_tokens": int(prompt_tokens),
            "completion_tokens": int(completion_tokens),
            "cost_usd": round(cost_usd, 8),
        })
    return pd.DataFrame(rows)


def print_report(df: pd.DataFrame) -> None:
    if df.empty:
        print("No spans to report.")
        return
    summary = (
        df.groupby("tool_name")
        .agg(
            calls=("span_name", "count"),
            mean_latency_ms=("duration_ms", "mean"),
            total_prompt_tokens=("prompt_tokens", "sum"),
            total_completion_tokens=("completion_tokens", "sum"),
            total_cost_usd=("cost_usd", "sum"),
        )
        .reset_index()
    )
    summary["mean_latency_ms"] = summary["mean_latency_ms"].round(2)
    summary["total_cost_usd"] = summary["total_cost_usd"].round(8)
    try:
        from tabulate import tabulate
        print(tabulate(summary, headers="keys", tablefmt="github", showindex=False))
    except ImportError:
        print(summary.to_string(index=False))
```

Now run the report against synthetic spans that mimic what a real agent run would produce:

```python
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
import time
from report import build_report_from_spans, print_report

exporter = InMemorySpanExporter()
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(exporter))
tracer = provider.get_tracer("report-test")

# Simulate three tool calls with realistic token counts
scenarios = [
    ("tool.get_weather",   "get_weather",   "London",   "12°C, overcast",          35, 12, 0.008),
    ("tool.get_weather",   "get_weather",   "Tokyo",    "28°C, sunny",             35, 10, 0.006),
    ("tool.convert_units", "convert_units", "100 km",   "100 km = 62.14 miles",    48, 20, 0.012),
    ("tool.convert_units", "convert_units", "25 C",     "25 celsius = 77.00 F",    48, 18, 0.010),
    ("tool.get_weather",   "get_weather",   "New York", "18°C, partly cloudy",     35, 11, 0.007),
]

for span_name, tool_name, inp, out, prompt_tok, comp_tok, sleep_s in scenarios:
    with tracer.start_as_current_span(span_name) as span:
        span.set_attribute("tool.name", tool_name)
        span.set_attribute("input.value", inp)
        span.set_attribute("output.value", out)
        span.set_attribute("llm.token_count.prompt", prompt_tok)
        span.set_attribute("llm.token_count.completion", comp_tok)
        time.sleep(sleep_s)

finished = exporter.get_finished_spans()
df = build_report_from_spans(finished)
print(f"Total spans captured: {len(finished)}")
print()
print_report(df)
print("report generation ok")
```

## Step 6: Connecting to a live Phoenix UI (optional)

If you have Docker available, start Phoenix with:

```bash
# skip_docker: run this on your own machine, not in the sandbox
docker run -p 6006:6006 -p 4317:4317 arizephoenix/phoenix:latest
```

Then in `tracing.py`, the `register()` call already points at `localhost:4317`. Run `python run_agent.py` (with `OPENAI_API_KEY` set) and open `http://localhost:6006`. Navigate to the `langgraph-tool-tracing` project. Each agent invocation appears as a trace tree: the top-level `langchain.agent` span contains child spans for each LLM call and each tool dispatch. Click any `tool.get_weather` span to see `input.value`, `output.value`, and token counts in the attributes panel.

> [!PULLQUOTE]
> The same span structure indexes the same way on Datadog or Honeycomb. Only the exporter endpoint changes.

For commercial backends (Datadog, Honeycomb, New Relic), replace the `endpoint` in `register()` with the vendor's OTLP ingest URL and set `OTEL_EXPORTER_OTLP_HEADERS` to your API key. The OpenInference span shape is identical regardless of backend.

## Verify it works

Run the full verification sequence. All three checks should pass without any API key:

```python
import time
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
from unittest.mock import MagicMock
from agent import build_agent
from report import build_report_from_spans, print_report

# Check 1: graph structure
stub = MagicMock()
stub.bind_tools.return_value = stub
app = build_agent(model=stub)
assert "agent" in app.get_graph().nodes
assert "tools" in app.get_graph().nodes
print("[1/3] graph structure ok")

# Check 2: span emission
exporter = InMemorySpanExporter()
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(exporter))
tracer = provider.get_tracer("verify")
with tracer.start_as_current_span("tool.get_weather") as span:
    span.set_attribute("tool.name", "get_weather")
    span.set_attribute("llm.token_count.prompt", 40)
    span.set_attribute("llm.token_count.completion", 15)
    time.sleep(0.005)
assert len(exporter.get_finished_spans()) == 1
print("[2/3] span emission ok")

# Check 3: report generation
df = build_report_from_spans(exporter.get_finished_spans())
assert len(df) == 1
assert df.iloc[0]["tool_name"] == "get_weather"
assert df.iloc[0]["prompt_tokens"] == 40
print("[3/3] report generation ok")
print("all checks passed")
```

## Troubleshooting

**`ModuleNotFoundError: No module named 'openinference'`** — The package name on PyPI is `openinference-instrumentation-langchain`, not `openinference`. Run `uv pip install openinference-instrumentation-langchain` and confirm with `python -c "from openinference.instrumentation.langchain import LangChainInstrumentation"`.

**Phoenix UI shows no spans after `run_agent.py` finishes** — The default `BatchSpanProcessor` buffers spans and flushes at process exit, but if the process exits before the flush completes the spans are lost. Call `tracer_provider.force_flush()` before the script exits, as shown in `run_agent.py`.

**`ConnectionRefusedError` when exporting to `localhost:4317`** — Phoenix is not running or not listening on that port. Either start Phoenix with Docker (`docker run -p 4317:4317 arizephoenix/phoenix:latest`) or use `px.launch_app()` in-process as shown in `tracing.py`. Confirm with `curl -s http://localhost:6006/healthz`.

**Tool calls are not appearing as child spans** — `LangChainInstrumentation().instrument()` must be called before the graph is compiled and before any invocation. Move `setup_tracing()` to the top of your entry-point script, before any `import` of `agent.py` that triggers graph construction.

**`openai.AuthenticationError`** — `OPENAI_API_KEY` is not set or is invalid. The structural and tracing verification steps in this tutorial run without a key. Only `run_agent.py` (Step 3) requires one.

**Report shows zero cost for all spans** — The `llm.token_count.prompt` and `llm.token_count.completion` attributes are only set on LLM spans, not on tool spans. When reading real Phoenix data, filter for spans where `span_kind == "LLM"` or join tool spans to their parent LLM spans by `parent_id`.

## Next steps

- **Add a retry loop with structured output validation**: the agent currently trusts tool outputs verbatim. Wrapping tool calls with a validation layer (checking that `get_weather` returns a string matching an expected pattern) and emitting a `tool.validation.error` span attribute when it fails gives you a precise error rate per tool in Phoenix.
- **Persist traces across restarts**: `px.launch_app()` defaults to an in-memory store. Pass `storage=px.SqliteTraceStorage("/workspace/traces.db")` to keep spans between runs and query them with `px.Client().get_spans_dataframe()`.
- **Add streaming token attribution**: for long-running tool chains, stream the LLM response and emit intermediate spans with partial token counts so you can see cost accumulate in real time in the Phoenix waterfall view.
- **Export to a production backend**: swap `endpoint="http://localhost:4317"` for your Honeycomb, Datadog, or Grafana Cloud OTLP endpoint. The OpenInference span shape is backend-agnostic; no other code changes are needed.

## FAQ

### How does OpenInference standardize LangGraph span attributes?

OpenInference defines first-class attributes like input.value, output.value, llm.token_count.prompt, llm.token_count.completion, and tool.name on every span. This consistent shape allows queries and cost attribution without parsing log strings.

### What does the LangChainInstrumentation().instrument() call do?

It auto-instruments both LangChain and LangGraph graphs to emit OpenTelemetry spans for every LLM call, tool dispatch, and graph node transition. A single call wraps the entire graph without modifying application code.

### Can Phoenix run without Docker?

Yes. The px.launch_app() call starts Phoenix as a lightweight in-process collector in a background thread, writing to a local SQLite file. This requires no Docker or external service.

### How are tool costs calculated from spans?

The report reads llm.token_count.prompt and llm.token_count.completion attributes from each span, multiplies by per-token pricing (gpt-4o-mini: 0.15 USD per 1M input tokens, 0.60 per 1M output tokens), and groups by tool.name to show total cost per tool.

### Why does the agent need tracer_provider.force_flush() before exit?

The default BatchSpanProcessor buffers spans and flushes at process exit. If the process exits before the flush completes, spans are lost. force_flush() ensures all buffered spans are sent to Phoenix before the script terminates.

## References

1. https://www.reddit.com/r/LLMDevs/comments/1tagwwf/the_gap_between_the_model_returned_json_and_the/
