Tracing LangGraph Tool Calls with OpenTelemetry and Phoenix

Why this matters

LangGraph agents in production fail in ways that are invisible without tracing. A tool call returns, the graph moves to the next node, and somewhere downstream the response is wrong. Without span-level visibility you can’t tell whether the latency spike came from the LLM call, the tool execution, or the graph’s own routing logic. You also can’t attribute token costs to individual tool paths, which makes it impossible to optimize the expensive ones.

Arize Phoenix added first-class OpenInference support for LangGraph in its openinference-instrumentation-langchain package, which instruments LangChain and LangGraph graphs with a single instrument() call. The OpenInference semantic conventions give every span a consistent shape: input.value, output.value, llm.token_count.prompt, llm.token_count.completion, and tool.name are all first-class attributes you can query without parsing log strings.

This tutorial wires the full path: agent definition, auto-instrumentation, local Phoenix collector, and a Python report that reads spans back out and prints per-tool cost and latency numbers.

Prerequisites

Python 3.11 or 3.12
An OpenAI API key (the live agent steps require it; all structural and tracing steps run without one)
Basic familiarity with LangGraph graphs and nodes
Docker, if you want to run the Phoenix UI (the tutorial also shows the in-process mode that needs no Docker)

Setup

Install the core dependencies. openinference-instrumentation-langchain handles both LangChain and LangGraph graphs.

uv pip install langgraph langchain-openai openai \
  openinference-instrumentation-langchain \
  arize-phoenix-otel opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-grpc \
  arize-phoenix pandas tabulate

Verify the key packages installed correctly:

from importlib.metadata import version
for pkg in ["langgraph", "openinference-instrumentation-langchain", "arize-phoenix"]:
    print(f"{pkg}: {version(pkg)}")
print("imports ok")

Step 1: Define the agent and its tools

The agent has two tools: a mock weather lookup and a mock unit converter. Both are pure Python functions decorated with @tool so LangGraph’s ToolNode can dispatch to them automatically.

# filename: agent.py
import os
from typing import Annotated
from langchain_core.tools import tool
from langchain_core.messages import BaseMessage, HumanMessage
from langchain_openai import ChatOpenAI
from langgraph.graph import StateGraph, END
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode
from typing_extensions import TypedDict


@tool
def get_weather(city: str) -> str:
    """Return the current weather for a city (mock)."""
    data = {
        "london": "12°C, overcast",
        "tokyo": "28°C, sunny",
        "new york": "18°C, partly cloudy",
    }
    return data.get(city.lower(), f"No data for {city}")


@tool
def convert_units(value: float, from_unit: str, to_unit: str) -> str:
    """Convert a value between units (mock: supports celsius<->fahrenheit, km<->miles)."""
    conversions = {
        ("celsius", "fahrenheit"): lambda v: v * 9 / 5 + 32,
        ("fahrenheit", "celsius"): lambda v: (v - 32) * 5 / 9,
        ("km", "miles"): lambda v: v * 0.621371,
        ("miles", "km"): lambda v: v * 1.60934,
    }
    key = (from_unit.lower(), to_unit.lower())
    if key not in conversions:
        return f"Conversion from {from_unit} to {to_unit} not supported"
    result = conversions[key](value)
    return f"{value} {from_unit} = {result:.2f} {to_unit}"


TOOLS = [get_weather, convert_units]


class AgentState(TypedDict):
    messages: Annotated[list[BaseMessage], add_messages]


def build_agent(model: object = None):
    """Build the LangGraph agent. Accepts an injected model for testing."""
    if model is None:
        model = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    llm_with_tools = model.bind_tools(TOOLS)

    def call_model(state: AgentState) -> dict:
        response = llm_with_tools.invoke(state["messages"])
        return {"messages": [response]}

    def should_continue(state: AgentState) -> str:
        last = state["messages"][-1]
        if hasattr(last, "tool_calls") and last.tool_calls:
            return "tools"
        return END

    tool_node = ToolNode(TOOLS)

    graph = StateGraph(AgentState)
    graph.add_node("agent", call_model)
    graph.add_node("tools", tool_node)
    graph.set_entry_point("agent")
    graph.add_conditional_edges("agent", should_continue, {"tools": "tools", END: END})
    graph.add_edge("tools", "agent")

    return graph.compile()

Verify the graph compiles and has the expected nodes (no API key needed here because build_agent() accepts an injected stub):

from unittest.mock import MagicMock
from agent import build_agent

stub_model = MagicMock()
stub_model.bind_tools.return_value = stub_model

app = build_agent(model=stub_model)
nodes = list(app.get_graph().nodes.keys())
print("nodes:", sorted(nodes))
assert "agent" in nodes
assert "tools" in nodes
print("graph structure ok")

Step 2: Start Phoenix in-process and wire OpenTelemetry

Phoenix can run as a lightweight in-process collector that writes to a local SQLite file. This requires no Docker and no external service. The px.launch_app() call starts the Phoenix server in a background thread and returns an endpoint you can point an OTLP exporter at.

For the sandbox (no display, no browser), use the arize-phoenix-otel helper to register the tracer provider. In your own environment you’d open http://localhost:6006 to see the UI.

# filename: tracing.py
import phoenix as px
from phoenix.otel import register
from openinference.instrumentation.langchain import LangChainInstrumentation


def setup_tracing(project_name: str = "langgraph-tool-tracing") -> object:
    """
    Launch Phoenix in-process, register the OTLP tracer provider,
    and instrument LangChain/LangGraph. Returns the tracer provider.
    """
    # Launch Phoenix as a background thread (no Docker needed)
    session = px.launch_app()
    print(f"Phoenix UI: {session.url}")

    # Register a tracer provider that exports to the local Phoenix collector
    tracer_provider = register(
        project_name=project_name,
        endpoint="http://localhost:4317",  # Phoenix default gRPC OTLP port
        set_global_tracer_provider=True,
    )

    # Auto-instrument LangChain and LangGraph
    LangChainInstrumentation().instrument(tracer_provider=tracer_provider)

    return tracer_provider

Step 3: Run the agent and collect spans

This step requires a real OpenAI API key. The block is marked to skip in the sandbox but you can run it locally after setting OPENAI_API_KEY.

# filename: run_agent.py
import os
from langchain_core.messages import HumanMessage
from tracing import setup_tracing
from agent import build_agent


def main():
    tracer_provider = setup_tracing()

    app = build_agent()  # uses ChatOpenAI — needs OPENAI_API_KEY

    queries = [
        "What's the weather in Tokyo and London?",
        "Convert 100 km to miles, then convert 25 celsius to fahrenheit.",
        "What's the weather in New York and convert 72 fahrenheit to celsius?",
    ]

    for query in queries:
        print(f"\nQuery: {query}")
        result = app.invoke({"messages": [HumanMessage(content=query)]})
        final = result["messages"][-1]
        print(f"Answer: {final.content[:120]}")

    # Flush all buffered spans before the process exits
    tracer_provider.force_flush()
    print("\nAll spans flushed to Phoenix.")


if __name__ == "__main__":
    main()

Run the agent (requires OPENAI_API_KEY):

# Set your key first:
# export OPENAI_API_KEY="sk-..."
python /workspace/run_agent.py

Step 4: Verify tracing with a synthetic span

To confirm the OTel pipeline works end-to-end without an API key, emit a synthetic span directly and assert it lands in Phoenix’s span store. This uses SimpleSpanProcessor so the span flushes synchronously before the assertion runs.

import time
import phoenix as px
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter

# Use an in-memory exporter so we can assert on spans without a live Phoenix server
exporter = InMemorySpanExporter()
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(exporter))

tracer = provider.get_tracer("test-tracer")

with tracer.start_as_current_span("tool.get_weather") as span:
    span.set_attribute("tool.name", "get_weather")
    span.set_attribute("input.value", "Tokyo")
    span.set_attribute("output.value", "28°C, sunny")
    span.set_attribute("llm.token_count.prompt", 42)
    span.set_attribute("llm.token_count.completion", 18)
    time.sleep(0.01)  # simulate tool latency

spans = exporter.get_finished_spans()
assert len(spans) == 1, f"Expected 1 span, got {len(spans)}"
assert spans[0].name == "tool.get_weather"
assert spans[0].attributes["tool.name"] == "get_weather"
print(f"Captured span: {spans[0].name}")
print(f"tool.name attribute: {spans[0].attributes['tool.name']}")
print(f"input.value: {spans[0].attributes['input.value']}")
print("synthetic span assertion passed")

Step 5: Build the cost and latency attribution report

Phoenix stores spans in a local SQLite database. The px.Client() API lets you pull spans as a pandas DataFrame. The report groups by tool name, computes mean latency, and estimates cost using OpenAI’s gpt-4o-mini pricing.

The block below works against the in-memory exporter from Step 4 to stay runnable without a live Phoenix server. In production, replace spans_df construction with px.Client().get_spans_dataframe(project_name="langgraph-tool-tracing").

# filename: report.py
import pandas as pd
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
import time

# GPT-4o-mini pricing (per 1M tokens, as of mid-2025)
PRICE_PER_1M_INPUT = 0.15   # USD
PRICE_PER_1M_OUTPUT = 0.60  # USD


def build_report_from_spans(finished_spans: list) -> pd.DataFrame:
    """Convert a list of finished OTel spans into a cost+latency report."""
    rows = []
    for span in finished_spans:
        attrs = span.attributes or {}
        duration_ms = (span.end_time - span.start_time) / 1_000_000  # ns -> ms
        prompt_tokens = attrs.get("llm.token_count.prompt", 0)
        completion_tokens = attrs.get("llm.token_count.completion", 0)
        cost_usd = (
            prompt_tokens / 1_000_000 * PRICE_PER_1M_INPUT
            + completion_tokens / 1_000_000 * PRICE_PER_1M_OUTPUT
        )
        rows.append({
            "span_name": span.name,
            "tool_name": attrs.get("tool.name", span.name),
            "duration_ms": round(duration_ms, 2),
            "prompt_tokens": int(prompt_tokens),
            "completion_tokens": int(completion_tokens),
            "cost_usd": round(cost_usd, 8),
        })
    return pd.DataFrame(rows)


def print_report(df: pd.DataFrame) -> None:
    if df.empty:
        print("No spans to report.")
        return
    summary = (
        df.groupby("tool_name")
        .agg(
            calls=("span_name", "count"),
            mean_latency_ms=("duration_ms", "mean"),
            total_prompt_tokens=("prompt_tokens", "sum"),
            total_completion_tokens=("completion_tokens", "sum"),
            total_cost_usd=("cost_usd", "sum"),
        )
        .reset_index()
    )
    summary["mean_latency_ms"] = summary["mean_latency_ms"].round(2)
    summary["total_cost_usd"] = summary["total_cost_usd"].round(8)
    try:
        from tabulate import tabulate
        print(tabulate(summary, headers="keys", tablefmt="github", showindex=False))
    except ImportError:
        print(summary.to_string(index=False))

Now run the report against synthetic spans that mimic what a real agent run would produce:

from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
import time
from report import build_report_from_spans, print_report

exporter = InMemorySpanExporter()
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(exporter))
tracer = provider.get_tracer("report-test")

# Simulate three tool calls with realistic token counts
scenarios = [
    ("tool.get_weather",   "get_weather",   "London",   "12°C, overcast",          35, 12, 0.008),
    ("tool.get_weather",   "get_weather",   "Tokyo",    "28°C, sunny",             35, 10, 0.006),
    ("tool.convert_units", "convert_units", "100 km",   "100 km = 62.14 miles",    48, 20, 0.012),
    ("tool.convert_units", "convert_units", "25 C",     "25 celsius = 77.00 F",    48, 18, 0.010),
    ("tool.get_weather",   "get_weather",   "New York", "18°C, partly cloudy",     35, 11, 0.007),
]

for span_name, tool_name, inp, out, prompt_tok, comp_tok, sleep_s in scenarios:
    with tracer.start_as_current_span(span_name) as span:
        span.set_attribute("tool.name", tool_name)
        span.set_attribute("input.value", inp)
        span.set_attribute("output.value", out)
        span.set_attribute("llm.token_count.prompt", prompt_tok)
        span.set_attribute("llm.token_count.completion", comp_tok)
        time.sleep(sleep_s)

finished = exporter.get_finished_spans()
df = build_report_from_spans(finished)
print(f"Total spans captured: {len(finished)}")
print()
print_report(df)
print("report generation ok")

Step 6: Connecting to a live Phoenix UI (optional)

If you have Docker available, start Phoenix with:

# skip_docker: run this on your own machine, not in the sandbox
docker run -p 6006:6006 -p 4317:4317 arizephoenix/phoenix:latest

Then in tracing.py, the register() call already points at localhost:4317. Run python run_agent.py (with OPENAI_API_KEY set) and open http://localhost:6006. Navigate to the langgraph-tool-tracing project. Each agent invocation appears as a trace tree: the top-level langchain.agent span contains child spans for each LLM call and each tool dispatch. Click any tool.get_weather span to see input.value, output.value, and token counts in the attributes panel.

The same span structure indexes the same way on Datadog or Honeycomb. Only the exporter endpoint changes.

For commercial backends (Datadog, Honeycomb, New Relic), replace the endpoint in register() with the vendor’s OTLP ingest URL and set OTEL_EXPORTER_OTLP_HEADERS to your API key. The OpenInference span shape is identical regardless of backend.

Verify it works

Run the full verification sequence. All three checks should pass without any API key:

import time
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
from unittest.mock import MagicMock
from agent import build_agent
from report import build_report_from_spans, print_report

# Check 1: graph structure
stub = MagicMock()
stub.bind_tools.return_value = stub
app = build_agent(model=stub)
assert "agent" in app.get_graph().nodes
assert "tools" in app.get_graph().nodes
print("[1/3] graph structure ok")

# Check 2: span emission
exporter = InMemorySpanExporter()
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(exporter))
tracer = provider.get_tracer("verify")
with tracer.start_as_current_span("tool.get_weather") as span:
    span.set_attribute("tool.name", "get_weather")
    span.set_attribute("llm.token_count.prompt", 40)
    span.set_attribute("llm.token_count.completion", 15)
    time.sleep(0.005)
assert len(exporter.get_finished_spans()) == 1
print("[2/3] span emission ok")

# Check 3: report generation
df = build_report_from_spans(exporter.get_finished_spans())
assert len(df) == 1
assert df.iloc[0]["tool_name"] == "get_weather"
assert df.iloc[0]["prompt_tokens"] == 40
print("[3/3] report generation ok")
print("all checks passed")

Troubleshooting

ModuleNotFoundError: No module named 'openinference' — The package name on PyPI is openinference-instrumentation-langchain, not openinference. Run uv pip install openinference-instrumentation-langchain and confirm with python -c "from openinference.instrumentation.langchain import LangChainInstrumentation".

Phoenix UI shows no spans after run_agent.py finishes — The default BatchSpanProcessor buffers spans and flushes at process exit, but if the process exits before the flush completes the spans are lost. Call tracer_provider.force_flush() before the script exits, as shown in run_agent.py.

ConnectionRefusedError when exporting to localhost:4317 — Phoenix is not running or not listening on that port. Either start Phoenix with Docker (docker run -p 4317:4317 arizephoenix/phoenix:latest) or use px.launch_app() in-process as shown in tracing.py. Confirm with curl -s http://localhost:6006/healthz.

Tool calls are not appearing as child spans — LangChainInstrumentation().instrument() must be called before the graph is compiled and before any invocation. Move setup_tracing() to the top of your entry-point script, before any import of agent.py that triggers graph construction.

openai.AuthenticationError — OPENAI_API_KEY is not set or is invalid. The structural and tracing verification steps in this tutorial run without a key. Only run_agent.py (Step 3) requires one.

Report shows zero cost for all spans — The llm.token_count.prompt and llm.token_count.completion attributes are only set on LLM spans, not on tool spans. When reading real Phoenix data, filter for spans where span_kind == "LLM" or join tool spans to their parent LLM spans by parent_id.

Next steps

Add a retry loop with structured output validation: the agent currently trusts tool outputs verbatim. Wrapping tool calls with a validation layer (checking that get_weather returns a string matching an expected pattern) and emitting a tool.validation.error span attribute when it fails gives you a precise error rate per tool in Phoenix.
Persist traces across restarts: px.launch_app() defaults to an in-memory store. Pass storage=px.SqliteTraceStorage("/workspace/traces.db") to keep spans between runs and query them with px.Client().get_spans_dataframe().
Add streaming token attribution: for long-running tool chains, stream the LLM response and emit intermediate spans with partial token counts so you can see cost accumulate in real time in the Phoenix waterfall view.
Export to a production backend: swap endpoint="http://localhost:4317" for your Honeycomb, Datadog, or Grafana Cloud OTLP endpoint. The OpenInference span shape is backend-agnostic; no other code changes are needed.

FAQ

How does OpenInference standardize LangGraph span attributes?

OpenInference defines first-class attributes like input.value, output.value, llm.token_count.prompt, llm.token_count.completion, and tool.name on every span. This consistent shape allows queries and cost attribution without parsing log strings.

What does the LangChainInstrumentation().instrument() call do?

It auto-instruments both LangChain and LangGraph graphs to emit OpenTelemetry spans for every LLM call, tool dispatch, and graph node transition. A single call wraps the entire graph without modifying application code.

Can Phoenix run without Docker?

Yes. The px.launch_app() call starts Phoenix as a lightweight in-process collector in a background thread, writing to a local SQLite file. This requires no Docker or external service.

How are tool costs calculated from spans?

The report reads llm.token_count.prompt and llm.token_count.completion attributes from each span, multiplies by per-token pricing (gpt-4o-mini: 0.15 USD per 1M input tokens, 0.60 per 1M output tokens), and groups by tool.name to show total cost per tool.

Why does the agent need tracer_provider.force_flush() before exit?

The default BatchSpanProcessor buffers spans and flushes at process exit. If the process exits before the flush completes, spans are lost. force_flush() ensures all buffered spans are sent to Phoenix before the script terminates.