Why this matters
As coding agents gain the ability to autonomously run commands and interact with development tools [2], the gap between “works on my laptop” and “auditable in production” has become a real operational problem. Print-based debugging collapses under any multi-step agent: you get a wall of text with no timing, no causal structure, and no way to correlate a slow response with a specific model or tool call.
OpenLLMetry (the OpenTelemetry instrumentation layer for LLM workloads) solves this by auto-instrumenting LiteLLM’s completion calls and emitting spans that carry gen_ai.model, gen_ai.usage.prompt_tokens, gen_ai.usage.completion_tokens, and tool-call attributes. Because LiteLLM routes to any provider behind a single API surface, the same instrumentation code works whether your agent calls GPT-4o, Claude 3.5, or a self-hosted model on Hetzner.
This tutorial wires OpenLLMetry to a console exporter (runnable anywhere, no Docker needed) and shows the exact span attributes you can forward to Grafana Tempo or any OTLP-compatible backend by swapping one exporter line.
Prerequisites
- Python 3.11 or 3.12
- An API key for at least one LLM provider (OpenAI, Anthropic, or any LiteLLM-supported provider)
- Basic familiarity with OpenTelemetry concepts (spans, exporters, tracer providers)
- Optional: a running Grafana Tempo instance if you want to forward traces beyond the console
Setup
Install LiteLLM, the OpenLLMetry instrumentation package, and the OpenTelemetry SDK:
uv pip install litellm opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc openinference-instrumentation-litellm
Verify the key packages are present:
from importlib.metadata import version
print("litellm:", version("litellm"))
print("openinference-instrumentation-litellm:", version("openinference-instrumentation-litellm"))
print("opentelemetry-sdk:", version("opentelemetry-sdk"))
print("setup_ok")
Step 1: Understand what the instrumentation captures
Before writing any agent code, it helps to see exactly what OpenLLMetry emits. The LiteLLMInstrumentor patches litellm.completion (and its async variant) at import time. Each call becomes a span with these attributes:
| Attribute | Example value |
|---|---|
gen_ai.system | openai |
gen_ai.request.model | gpt-4o-mini |
gen_ai.usage.prompt_tokens | 42 |
gen_ai.usage.completion_tokens | 18 |
gen_ai.response.finish_reasons | ["stop"] |
llm.request.type | chat |
Tool calls add a child span per tool with tool.name and the serialized arguments. This structure is what lets you write Tempo queries like {span.gen_ai.request.model="gpt-4o-mini"} and immediately see every call to that model across all agent runs.
Step 2: Wire the tracer provider with a console exporter
The console exporter is the right starting point: it requires no running service, and the output is identical in structure to what you’d send to Tempo. You swap the exporter later without changing any agent code.
# filename: tracing_setup.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from openinference.instrumentation.litellm import LiteLLMInstrumentor
def configure_tracing() -> TracerProvider:
"""Set up a TracerProvider that prints spans to stdout.
Swap ConsoleSpanExporter for OTLPSpanExporter to forward to Tempo.
"""
provider = TracerProvider()
# SimpleSpanProcessor flushes each span synchronously -- ideal for scripts
# and tests. Use BatchSpanProcessor in long-running services.
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)
# Patch litellm.completion so every call emits a span automatically.
LiteLLMInstrumentor().instrument()
return provider
Two design choices worth noting:
SimpleSpanProcessorflushes each span the moment it ends. This is correct for scripts and tests. In a long-running service, replace it withBatchSpanProcessorfor throughput.LiteLLMInstrumentor().instrument()monkey-patches LiteLLM globally. Call it once at startup, before anylitellm.completioncall.
Step 3: Build a minimal tool-calling agent
This agent simulates a two-tool workflow: a calculator and a unit converter. The tools are plain Python functions. The agent loop calls LiteLLM, checks for tool calls in the response, dispatches them, and feeds results back. No framework needed.
# filename: agent.py
import json
import litellm
from opentelemetry import trace
# Tool implementations
def calculate(expression: str) -> str:
"""Evaluate a simple arithmetic expression."""
try:
result = eval(expression, {"__builtins__": {}}) # noqa: S307
return str(result)
except Exception as exc:
return f"error: {exc}"
def convert_units(value: float, from_unit: str, to_unit: str) -> str:
"""Convert between a small set of units."""
conversions = {
("km", "miles"): 0.621371,
("miles", "km"): 1.60934,
("kg", "lbs"): 2.20462,
("lbs", "kg"): 0.453592,
}
factor = conversions.get((from_unit, to_unit))
if factor is None:
return f"unknown conversion: {from_unit} -> {to_unit}"
return f"{value * factor:.4f} {to_unit}"
TOOL_REGISTRY = {
"calculate": calculate,
"convert_units": lambda args: convert_units(
args["value"], args["from_unit"], args["to_unit"]
),
}
TOOL_SCHEMAS = [
{
"type": "function",
"function": {
"name": "calculate",
"description": "Evaluate an arithmetic expression",
"parameters": {
"type": "object",
"properties": {"expression": {"type": "string"}},
"required": ["expression"],
},
},
},
{
"type": "function",
"function": {
"name": "convert_units",
"description": "Convert a numeric value between units",
"parameters": {
"type": "object",
"properties": {
"value": {"type": "number"},
"from_unit": {"type": "string"},
"to_unit": {"type": "string"},
},
"required": ["value", "from_unit", "to_unit"],
},
},
},
]
def run_agent(model: str, user_message: str) -> str:
"""Run a single-turn agent loop with tool support.
The model parameter is any LiteLLM model string, e.g.:
'openai/gpt-4o-mini', 'anthropic/claude-3-haiku-20240307'
"""
tracer = trace.get_tracer("agent")
messages = [{"role": "user", "content": user_message}]
with tracer.start_as_current_span("agent.run") as agent_span:
agent_span.set_attribute("agent.model", model)
agent_span.set_attribute("agent.user_message", user_message)
for step in range(5): # guard against infinite loops
response = litellm.completion(
model=model,
messages=messages,
tools=TOOL_SCHEMAS,
tool_choice="auto",
)
choice = response.choices[0]
if choice.finish_reason == "tool_calls":
tool_calls = choice.message.tool_calls
messages.append(choice.message)
for tc in tool_calls:
fn_name = tc.function.name
fn_args = json.loads(tc.function.arguments)
with tracer.start_as_current_span(f"tool.{fn_name}") as tool_span:
tool_span.set_attribute("tool.name", fn_name)
tool_span.set_attribute("tool.arguments", json.dumps(fn_args))
if fn_name == "calculate":
result = calculate(fn_args["expression"])
else:
result = TOOL_REGISTRY[fn_name](fn_args)
tool_span.set_attribute("tool.result", result)
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": result,
})
else:
final_answer = choice.message.content or ""
agent_span.set_attribute("agent.final_answer", final_answer)
agent_span.set_attribute("agent.steps", step + 1)
return final_answer
return "max steps reached"
The key structural point: the agent.run span wraps the entire loop, and each tool dispatch gets its own child span. This parent-child relationship is what Tempo renders as a waterfall, letting you see at a glance whether latency came from the LLM call or the tool execution.
The parent-child span relationship is what Tempo renders as a waterfall, letting you see at a glance whether latency came from the LLM call or the tool execution.
Step 4: Replace print statements with span events
The old pattern looks like this:
# Old print-debug style -- do not use
print(f"[DEBUG] Calling model {model} with {len(messages)} messages")
response = litellm.completion(...)
print(f"[DEBUG] Got response: {response.choices[0].finish_reason}")
print(f"[DEBUG] Tokens used: {response.usage.total_tokens}")
The structured replacement uses span events and attributes:
# filename: span_events_demo.py
from opentelemetry import trace
def demo_span_events():
tracer = trace.get_tracer("demo")
with tracer.start_as_current_span("llm.call") as span:
# Attributes are indexed and queryable
span.set_attribute("gen_ai.request.model", "gpt-4o-mini")
span.set_attribute("message_count", 3)
# Events are timestamped log entries inside the span
span.add_event("before_completion", {"message_count": 3})
# ... completion call would go here ...
span.add_event("after_completion", {
"finish_reason": "stop",
"total_tokens": 60,
})
span.set_attribute("gen_ai.usage.total_tokens", 60)
print("span_events_demo_ok")
demo_span_events()
Attributes (set_attribute) are indexed fields you filter on. Events (add_event) are timestamped log lines attached to the span timeline. Use attributes for things you want to aggregate (model name, token counts, tool names) and events for narrative checkpoints (“retrying after rate limit”, “cache hit”).
Step 5: Run the agent and inspect the trace output
The entry point below wires tracing, runs the agent with a mock response (so it executes without a real API key in the sandbox), and prints the span structure to stdout.
# filename: run_demo.py
import json
from unittest.mock import MagicMock, patch
from tracing_setup import configure_tracing
def make_mock_response(content: str, finish_reason: str = "stop"):
"""Build a minimal litellm-shaped response object."""
choice = MagicMock()
choice.finish_reason = finish_reason
choice.message.content = content
choice.message.tool_calls = None
usage = MagicMock()
usage.prompt_tokens = 25
usage.completion_tokens = 15
usage.total_tokens = 40
response = MagicMock()
response.choices = [choice]
response.usage = usage
response.model = "gpt-4o-mini"
return response
def run_traced_demo():
provider = configure_tracing()
mock_response = make_mock_response(
"The result of 42 * 7 is 294, which is approximately 182.7 miles."
)
with patch("litellm.completion", return_value=mock_response):
from agent import run_agent
result = run_agent(
model="openai/gpt-4o-mini",
user_message="What is 42 * 7, and convert that many km to miles?",
)
print("\n=== Agent result ===")
print(result)
# Force flush so all spans are written before the process exits
provider.force_flush()
print("\ndemo_complete")
run_traced_demo()
python /workspace/run_demo.py
You’ll see JSON span objects printed to stdout. Each span includes name, context.trace_id, context.span_id, parent_id, start_time, end_time, and the attributes dict. The agent.run span’s parent_id is null (it’s the root). The litellm.completion span (emitted by OpenLLMetry) and any tool.* spans have the agent span’s ID as their parent_id.
Step 6: Forward traces to Grafana Tempo (OTLP)
Swapping the exporter is a one-line change in tracing_setup.py. No agent code changes.
# filename: tracing_setup_tempo.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from openinference.instrumentation.litellm import LiteLLMInstrumentor
import os
def configure_tracing_tempo() -> TracerProvider:
"""Configure tracing to export to Grafana Tempo via OTLP/gRPC.
Set OTEL_EXPORTER_OTLP_ENDPOINT to your Tempo endpoint, e.g.:
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
"""
endpoint = os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317")
exporter = OTLPSpanExporter(endpoint=endpoint, insecure=True)
provider = TracerProvider()
# BatchSpanProcessor is correct for production: buffers and flushes efficiently.
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)
LiteLLMInstrumentor().instrument()
return provider
To run Tempo locally, start it with Docker (outside the sandbox):
# docker-compose.yml for local Tempo
services:
tempo:
image: grafana/tempo:latest
command: ["-config.file=/etc/tempo.yaml"]
ports:
- "4317:4317" # OTLP gRPC
- "3200:3200" # Tempo HTTP API
volumes:
- ./tempo.yaml:/etc/tempo.yaml
Then set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 and replace configure_tracing() with configure_tracing_tempo() in your entry point. The same span structure that printed to the console now indexes in Tempo. You can query it with TraceQL:
{ span.gen_ai.request.model = "gpt-4o-mini" } | avg(duration) by (span.tool.name)
The same OTLP payload works with Datadog, Honeycomb, or New Relic. Only the exporter endpoint and authentication headers change.
Verify it works
import io
import sys
from unittest.mock import MagicMock, patch
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from openinference.instrumentation.litellm import LiteLLMInstrumentor
# Fresh provider for this verification block
provider = TracerProvider()
buf = io.StringIO()
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter(out=buf)))
trace.set_tracer_provider(provider)
# Re-instrument with the fresh provider
LiteLLMInstrumentor().uninstrument()
LiteLLMInstrumentor().instrument()
mock_resp = MagicMock()
mock_resp.choices[0].finish_reason = "stop"
mock_resp.choices[0].message.content = "294"
mock_resp.choices[0].message.tool_calls = None
mock_resp.usage.prompt_tokens = 10
mock_resp.usage.completion_tokens = 5
mock_resp.usage.total_tokens = 15
mock_resp.model = "gpt-4o-mini"
with patch("litellm.completion", return_value=mock_resp):
from agent import run_agent
answer = run_agent("openai/gpt-4o-mini", "What is 6 * 7?")
provider.force_flush()
span_output = buf.getvalue()
# Verify the agent span was emitted
assert "agent.run" in span_output, f"Expected 'agent.run' span, got:\n{span_output[:500]}"
# Verify the model attribute was recorded
assert "gpt-4o-mini" in span_output, "Expected model attribute in span output"
# Verify the answer came back
assert answer == "294", f"Unexpected answer: {answer}"
print("verification_passed")
print(f"Answer: {answer}")
print(f"Span output length: {len(span_output)} chars")
Troubleshooting
ModuleNotFoundError: No module named 'openinference' — The package name on PyPI is openinference-instrumentation-litellm, not openinference. Run uv pip install openinference-instrumentation-litellm and confirm with uv pip show openinference-instrumentation-litellm.
Spans appear in the console but not in Tempo — Confirm Tempo is accepting OTLP on port 4317 with curl -v http://localhost:4317. If Tempo is behind TLS, set insecure=False in OTLPSpanExporter and provide the CA cert via OTEL_EXPORTER_OTLP_CERTIFICATE. Also confirm you called provider.force_flush() before process exit when using BatchSpanProcessor.
LiteLLMInstrumentor().instrument() raises RuntimeError: Already instrumented — You called instrument() twice in the same process. Guard with LiteLLMInstrumentor().uninstrument() before re-instrumenting, or check LiteLLMInstrumentor().is_instrumented_by_opentelemetry first.
Tool call spans are missing — The child tool.* spans are created by your agent code, not by OpenLLMetry. Confirm the with tracer.start_as_current_span(...) block in agent.py is inside the same thread as the litellm.completion call. Async agents need tracer.start_as_current_span replaced with async with tracer.start_as_current_span.
Token counts show as zero in spans — Some LiteLLM provider adapters don’t populate response.usage for streaming calls. Set stream=False for the instrumented path, or enable stream_options={"include_usage": True} if the provider supports it.
ConsoleSpanExporter output is empty after the agent call — You’re using BatchSpanProcessor instead of SimpleSpanProcessor. The batch processor flushes asynchronously. Either switch to SimpleSpanProcessor for local testing, or call provider.force_flush() immediately after the agent call and before reading the output buffer.
Next steps
- Add cost tracking: LiteLLM exposes
response._hidden_params["response_cost"]after each call. Record it asgen_ai.usage.coston the span and build a Grafana dashboard that aggregates spend by model and user. - Propagate trace context across services: If your agent calls downstream microservices over HTTP, inject the W3C
traceparentheader withopentelemetry.propagate.inject(headers)so Tempo links the full distributed trace. - Sample high-token traces: Configure a
ParentBasedTraceIdRatiosampler that keeps 100% of traces wheregen_ai.usage.total_tokens > 1000and samples the rest at 10%, reducing storage costs without losing visibility into expensive calls. - Export to a managed OTLP backend: Grafana Cloud, Honeycomb, and Datadog all accept the same OTLP payload. Replace
OTLPSpanExporter(endpoint="http://localhost:4317")with the vendor’s endpoint and setOTEL_EXPORTER_OTLP_HEADERSto the auth token. No agent code changes required.
FAQ
What span attributes does OpenLLMetry capture from LiteLLM calls?
OpenLLMetry captures gen_ai.system, gen_ai.request.model, gen_ai.usage.prompt_tokens, gen_ai.usage.completion_tokens, gen_ai.response.finish_reasons, and llm.request.type. Tool calls add child spans with tool.name and serialized arguments.
How do I forward traces from the console exporter to Grafana Tempo?
Replace ConsoleSpanExporter with OTLPSpanExporter in tracing_setup.py, pointing to your Tempo endpoint (default http://localhost:4317). The same span structure and agent code work unchanged; only the exporter line changes.
Should I use SimpleSpanProcessor or BatchSpanProcessor?
Use SimpleSpanProcessor for scripts and tests because it flushes each span immediately. Use BatchSpanProcessor in long-running services for better throughput. Remember to call provider.force_flush() before process exit with BatchSpanProcessor.
How do I query traces by model name or token count in Tempo?
Use TraceQL queries like { span.gen_ai.request.model = “gpt-4o-mini” } to filter by model, or { span.gen_ai.usage.total_tokens > 1000 } to find expensive calls. The indexed span attributes make these queries fast.
Why are my tool call spans missing from the trace output?
Tool spans are created by your agent code with tracer.start_as_current_span(), not by OpenLLMetry. Confirm the span block wraps the tool execution and runs in the same thread as the litellm.completion call.