Tracing Claude Code Tool Calls with OpenTelemetry and LangSmith

Why this matters

Claude Code v2.1.126 ships a concrete observability hook that most teams haven’t touched yet: the claude_code.skill_activated OpenTelemetry event now fires for every user-typed slash command and carries an invocation_trigger attribute with three possible values: "user-slash", "claude-proactive", and "nested-skill" [1]. Before this release, distinguishing a command the user typed from one Claude invoked autonomously required parsing raw transcript text. Now it’s a structured span attribute.

For teams running Claude Code in CI pipelines, pair-programming sessions, or agent loops, this matters operationally. A slow /test invocation that blocks a pipeline looks identical to a fast one in terminal output. A nested skill call that silently fails leaves no breadcrumb unless you’re capturing spans. LangSmith’s trace UI gives you a timeline view and a queryable API, so you can write a nightly job that flags any tool call exceeding a latency threshold or ending in an error state, without scraping logs.

Prerequisites

Python 3.11 or later
Claude Code v2.1.126 or later (claude --version to confirm)
A LangSmith account with an API key (free tier is sufficient)
Basic bash familiarity
Node 18+ (Claude Code is a Node package)

Setup

Install the Python dependencies used by the query script. The OpenTelemetry SDK handles span export; langsmith provides the REST client for querying traces after they land.

uv pip install opentelemetry-sdk opentelemetry-exporter-otlp-proto-http langsmith requests python-dotenv

Export the environment variables Claude Code reads to locate its OTel collector and the variables the query script needs for LangSmith. Replace the placeholder values with your real credentials before running subsequent blocks.

export LANGSMITH_API_KEY="ls__your_key_here"
export LANGSMITH_PROJECT="claude-code-traces"
export LANGCHAIN_TRACING_V2="true"
export LANGCHAIN_ENDPOINT="https://api.smith.langchain.com"
# Claude Code reads these to forward OTel spans
export OTEL_EXPORTER_OTLP_ENDPOINT="https://api.smith.langchain.com/otel"
export OTEL_EXPORTER_OTLP_HEADERS="x-api-key=${LANGSMITH_API_KEY}"
export OTEL_SERVICE_NAME="claude-code"
export CLAUDE_CODE_ENABLE_TELEMETRY="1"

Step 1: Understand the span structure

Before writing any code, it helps to know exactly what Claude Code emits. When a user types /test or Claude autonomously invokes a skill, Claude Code creates an OTel span with the event name claude_code.skill_activated and attaches these attributes [1]:

invocation_trigger: one of "user-slash", "claude-proactive", or "nested-skill"
skill.name: the slash command name, e.g. "test" or "build"
skill.status: "success" or "error"
skill.duration_ms: wall-clock time in milliseconds

LangSmith stores each span as a “run” inside a project. The LangSmith Python SDK exposes a list_runs method that accepts filter expressions, which the query script uses to pull only claude_code.skill_activated runs.

Step 2: Write a local OTel verification script

Before trusting that LangSmith is receiving spans, verify the OTel pipeline locally by emitting a synthetic claude_code.skill_activated span to a console exporter. This lets you confirm the span shape without needing Claude Code to be running.

# filename: emit_test_span.py
import time
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource

resource = Resource.create({"service.name": "claude-code"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("claude-code.skills")

def emit_skill_span(skill_name: str, trigger: str, duration_ms: float, status: str = "success"):
    with tracer.start_as_current_span("claude_code.skill_activated") as span:
        span.set_attribute("skill.name", skill_name)
        span.set_attribute("invocation_trigger", trigger)
        span.set_attribute("skill.duration_ms", duration_ms)
        span.set_attribute("skill.status", status)
        # Simulate the skill running
        time.sleep(duration_ms / 1000.0)

# Emit three representative spans
emit_skill_span("test", "user-slash", 320.0)
emit_skill_span("build", "claude-proactive", 4800.0)  # slow
emit_skill_span("lint", "nested-skill", 90.0, status="error")

print("verify_span_emit_ok")

import subprocess, sys
result = subprocess.run([sys.executable, "/workspace/emit_test_span.py"], capture_output=True, text=True, timeout=30)
print(result.stdout[-200:] if result.stdout else "")
print(result.stderr[-200:] if result.stderr else "")

Step 3: Build the LangSmith exporter bridge

This module wraps the OTLP HTTP exporter configured for LangSmith’s endpoint. Claude Code reads OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_EXPORTER_OTLP_HEADERS from the environment automatically when CLAUDE_CODE_ENABLE_TELEMETRY=1 is set. The module below is used by the query script in Step 4, not by Claude Code itself. Claude Code handles its own export; this module lets you emit additional synthetic spans from Python tests or CI scripts using the same pipeline.

# filename: langsmith_otel.py
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

def build_tracer(service_name: str = "claude-code") -> trace.Tracer:
    """Return a tracer that exports spans to LangSmith via OTLP/HTTP."""
    endpoint = os.environ.get(
        "OTEL_EXPORTER_OTLP_ENDPOINT",
        "https://api.smith.langchain.com/otel",
    )
    api_key = os.environ.get("LANGSMITH_API_KEY", "")
    headers = {"x-api-key": api_key}

    exporter = OTLPSpanExporter(
        endpoint=f"{endpoint.rstrip('/')}/v1/traces",
        headers=headers,
    )
    resource = Resource.create({"service.name": service_name})
    provider = TracerProvider(resource=resource)
    provider.add_span_processor(BatchSpanProcessor(exporter))
    trace.set_tracer_provider(provider)
    return trace.get_tracer(service_name)


def emit_skill_event(
    tracer: trace.Tracer,
    skill_name: str,
    trigger: str,
    duration_ms: float,
    status: str = "success",
) -> None:
    """Emit a single claude_code.skill_activated span."""
    with tracer.start_as_current_span("claude_code.skill_activated") as span:
        span.set_attribute("skill.name", skill_name)
        span.set_attribute("invocation_trigger", trigger)
        span.set_attribute("skill.duration_ms", duration_ms)
        span.set_attribute("skill.status", status)

Step 4: Write the trace query and flagging script

This is the core deliverable: a script that pulls claude_code.skill_activated runs from LangSmith and flags any that are slow (over a configurable threshold) or ended in an error.

# filename: flag_slow_tool_calls.py
"""
Query LangSmith for claude_code.skill_activated runs and flag
any that exceed SLOW_THRESHOLD_MS or have status=error.
"""
import os
import sys
from datetime import datetime, timedelta, timezone
from typing import Optional

try:
    from langsmith import Client
except ImportError:
    print("langsmith package not installed — run: uv pip install langsmith")
    sys.exit(1)

SLOW_THRESHOLD_MS: float = float(os.environ.get("SLOW_THRESHOLD_MS", "2000"))
LOOKBACK_HOURS: int = int(os.environ.get("LOOKBACK_HOURS", "24"))
PROJECT_NAME: str = os.environ.get("LANGSMITH_PROJECT", "claude-code-traces")


def query_skill_runs(client: Client, project: str, since: datetime):
    """Return all claude_code.skill_activated runs since `since`."""
    runs = client.list_runs(
        project_name=project,
        run_type="chain",
        filter='eq(name, "claude_code.skill_activated")',
        start_time=since,
    )
    return list(runs)


def flag_runs(runs: list, threshold_ms: float) -> list[dict]:
    """Return flagged runs with a reason string."""
    flagged = []
    for run in runs:
        reasons = []
        outputs = run.outputs or {}
        inputs = run.inputs or {}
        # LangSmith stores span attributes in extra.metadata or inputs
        metadata = (run.extra or {}).get("metadata", {})
        duration_ms = metadata.get("skill.duration_ms") or inputs.get("skill.duration_ms")
        status = metadata.get("skill.status") or inputs.get("skill.status", "")
        skill_name = metadata.get("skill.name") or inputs.get("skill.name", "unknown")
        trigger = metadata.get("invocation_trigger") or inputs.get("invocation_trigger", "unknown")

        if run.error:
            reasons.append(f"run error: {run.error[:120]}")
        if status == "error":
            reasons.append("skill.status=error")
        if duration_ms is not None and float(duration_ms) > threshold_ms:
            reasons.append(f"duration {duration_ms:.0f}ms > threshold {threshold_ms:.0f}ms")

        if reasons:
            flagged.append({
                "run_id": str(run.id),
                "skill": skill_name,
                "trigger": trigger,
                "duration_ms": duration_ms,
                "status": status,
                "reasons": reasons,
                "start_time": run.start_time,
            })
    return flagged


def main():
    api_key = os.environ.get("LANGSMITH_API_KEY")
    if not api_key:
        print("ERROR: LANGSMITH_API_KEY is not set.")
        sys.exit(1)

    client = Client(api_key=api_key)
    since = datetime.now(timezone.utc) - timedelta(hours=LOOKBACK_HOURS)

    print(f"Querying project '{PROJECT_NAME}' for skill runs since {since.isoformat()}")
    print(f"Slow threshold: {SLOW_THRESHOLD_MS:.0f} ms")
    print("-" * 60)

    try:
        runs = query_skill_runs(client, PROJECT_NAME, since)
    except Exception as exc:
        print(f"Could not reach LangSmith: {exc}")
        sys.exit(1)

    print(f"Found {len(runs)} claude_code.skill_activated run(s).")

    flagged = flag_runs(runs, SLOW_THRESHOLD_MS)

    if not flagged:
        print("No slow or failed tool calls detected.")
        return

    print(f"\nFlagged {len(flagged)} run(s):")
    for item in flagged:
        print(f"  run_id : {item['run_id']}")
        print(f"  skill  : {item['skill']} (trigger={item['trigger']})")
        print(f"  duration: {item['duration_ms']} ms | status: {item['status']}")
        for reason in item['reasons']:
            print(f"  REASON : {reason}")
        print()

    # Exit non-zero so CI pipelines can catch regressions
    sys.exit(1)


if __name__ == "__main__":
    main()

Step 5: Configure Claude Code to emit spans

Claude Code reads standard OTel environment variables. With the exports from the Setup section in place, start Claude Code normally and run a few slash commands:

# Confirm the environment is wired (no actual Claude Code invocation here)
echo "OTEL_EXPORTER_OTLP_ENDPOINT=${OTEL_EXPORTER_OTLP_ENDPOINT}"
echo "OTEL_SERVICE_NAME=${OTEL_SERVICE_NAME}"
echo "CLAUDE_CODE_ENABLE_TELEMETRY=${CLAUDE_CODE_ENABLE_TELEMETRY}"
echo "env_check_ok"

Once you confirm the variables are set, open a Claude Code session in a project directory and run commands like /test, /build, or /lint. Each invocation emits a claude_code.skill_activated span with invocation_trigger="user-slash" [1]. If Claude autonomously invokes a skill during a task, the span carries invocation_trigger="claude-proactive". Nested skill calls from within another skill carry invocation_trigger="nested-skill".

The spans appear in LangSmith under the project name you set in LANGSMITH_PROJECT, typically within 5-10 seconds of the command completing.

Step 6: Emit synthetic spans for testing (without a live Claude Code session)

To validate the full pipeline without running Claude Code, use the langsmith_otel module from Step 3 to push synthetic spans directly to LangSmith.

# This block requires LANGSMITH_API_KEY to be set.
import os, sys

api_key = os.environ.get("LANGSMITH_API_KEY", "")
if not api_key or api_key.startswith("ls__your"):
    print("LANGSMITH_API_KEY not configured — skipping live export (expected in sandbox).")
    print("synthetic_skip_ok")
    sys.exit(0)

from langsmith_otel import build_tracer, emit_skill_event
import time

tracer = build_tracer()

test_cases = [
    ("test",  "user-slash",      320.0,  "success"),
    ("build", "claude-proactive", 4800.0, "success"),  # will be flagged as slow
    ("lint",  "nested-skill",    90.0,   "error"),     # will be flagged as error
]

for skill, trigger, duration_ms, status in test_cases:
    emit_skill_event(tracer, skill, trigger, duration_ms, status)
    print(f"Emitted: {skill} ({trigger}) {duration_ms}ms [{status}]")
    time.sleep(0.1)

# Flush the batch processor
from opentelemetry import trace as otel_trace
otel_trace.get_tracer_provider().force_flush()
print("Spans flushed to LangSmith.")
print("synthetic_emit_ok")

Verify it works

Run the flagging script against your LangSmith project. In the sandbox the API key is not set, so the script exits with a clear error message rather than silently failing. On your machine with a real key, it will query the last 24 hours of runs and print any that exceed 2000 ms or carry skill.status=error.

import subprocess, sys, os

env = os.environ.copy()
env["SLOW_THRESHOLD_MS"] = "2000"
env["LOOKBACK_HOURS"] = "24"

result = subprocess.run(
    [sys.executable, "/workspace/flag_slow_tool_calls.py"],
    capture_output=True,
    text=True,
    env=env,
    timeout=30,
)
output = result.stdout + result.stderr
print(output[:800])
# Accept either a successful query or the expected "not set" error
assert "LANGSMITH_API_KEY" in output or "Querying project" in output or "Found" in output, \
    f"Unexpected output: {output}"
print("verify_script_runs_ok")

When the key is present and spans have been emitted, the output looks like:

Querying project 'claude-code-traces' for skill runs since 2025-01-15T10:00:00+00:00
Slow threshold: 2000 ms
------------------------------------------------------------
Found 3 claude_code.skill_activated run(s).

Flagged 2 run(s):
  run_id : 3f8a1b2c-...
  skill  : build (trigger=claude-proactive)
  duration: 4800.0 ms | status: success
  REASON : duration 4800ms > threshold 2000ms

  run_id : 9d4e7f1a-...
  skill  : lint (trigger=nested-skill)
  duration: 90.0 ms | status: error
  REASON : skill.status=error

The script exits with code 1 when flagged runs exist, making it suitable as a CI gate.

Troubleshooting

Spans do not appear in LangSmith after running slash commands. Confirm CLAUDE_CODE_ENABLE_TELEMETRY=1 is exported in the same shell where you launched Claude Code. The variable must be set before the process starts; setting it after launch has no effect. Also verify OTEL_EXPORTER_OTLP_HEADERS contains x-api-key=<your_key> with no extra whitespace.

list_runs returns zero results even though spans were emitted. LangSmith ingestion can lag by 10-30 seconds. Wait and retry. Also check that LANGSMITH_PROJECT matches the project name exactly, including case. LangSmith project names are case-sensitive.

The filter argument to list_runs raises a validation error. Older versions of the langsmith Python SDK use a different filter syntax. Run uv pip install --upgrade langsmith to get the version that accepts the eq(name, ...) expression syntax.

invocation_trigger is missing from span attributes. This attribute was added in Claude Code v2.1.126 [1]. Run claude --version and upgrade if the version is older. Spans from older versions still appear but lack the invocation_trigger field.

The flagging script exits 1 in CI even when no real regressions exist. The script exits 1 whenever flagged runs are found, including runs from previous sessions. Narrow the lookback window with LOOKBACK_HOURS=1 or filter by a specific invocation_trigger value by adding a secondary filter condition to query_skill_runs.

OTLPSpanExporter raises a connection error in a corporate network. Some networks block gRPC but allow HTTP. The exporter in this tutorial uses otlp-proto-http, which sends spans over HTTPS port 443. If that still fails, check whether an HTTP proxy is required and set HTTPS_PROXY accordingly.

Next steps

Add a Grafana dashboard by forwarding the same OTLP spans to an OpenTelemetry Collector that fans out to both LangSmith and a local Tempo instance. The span structure is identical; only the exporter endpoint changes.
Extend flag_slow_tool_calls.py to group results by invocation_trigger and compute p95 latency per skill, giving a clearer picture of whether slowness is user-driven or autonomous.
Wire the flagging script into a GitHub Actions workflow that runs after each Claude Code-assisted PR, posting a summary comment when any tool call exceeds the threshold.
Use the nested-skill trigger value to build a call graph: each nested span carries a parent span ID, so you can reconstruct the full skill invocation tree for complex multi-step tasks.