Why this matters
Running agents across multiple model providers is now the default, not the exception. Teams mix a fast cheap model for classification, a capable cloud model for reasoning, and a self-hosted endpoint for data-residency requirements. LiteLLM’s router handles the dispatch, but without instrumentation you’re flying blind: you can’t tell whether a latency spike comes from the remote provider, a cold vLLM instance, or a retry storm.
OpenLLMetry (the OpenTelemetry instrumentation layer for LLM calls) attaches token counts, model names, and finish reasons as span attributes on every call. Combined with LiteLLM’s built-in cost metadata, you can compute per-request dollar cost inside the span itself and ship it to any OTLP receiver, whether that’s a local console exporter during development or Grafana Tempo, SigNoz, or Honeycomb in production.
This tutorial wires all three pieces together: a LiteLLM Router with priority-ordered model groups, an OpenTelemetry TracerProvider with a synchronous console exporter for local verification, and a cost-attribution callback that writes llm.usage.cost_usd onto the active span. The same span structure indexes identically on Datadog or Honeycomb; only the exporter endpoint changes.
The same span structure indexes identically on Datadog or Honeycomb; only the exporter endpoint changes.
Prerequisites
- Python 3.11 or 3.12
- Basic familiarity with OpenTelemetry concepts (tracer, span, exporter)
- A Mistral API key (for the live-call steps; structural steps run without one)
- Optional: a running vLLM endpoint or Ollama instance for the local-model leg
- Optional: Grafana Tempo or any OTLP/HTTP receiver for production export
Setup
Install the required packages. opentelemetry-sdk provides the tracer and console exporter. openinference-instrumentation-litellm is the OpenLLMetry auto-instrumentation package for LiteLLM.
uv pip install "litellm>=1.40.0" \
opentelemetry-sdk \
opentelemetry-exporter-otlp-proto-http \
openinference-instrumentation-litellm \
openinference-semantic-conventions
Verify the installs:
from importlib.metadata import version
for pkg in [
"litellm",
"opentelemetry-sdk",
"openinference-instrumentation-litellm",
"openinference-semantic-conventions",
]:
print(f"{pkg}: {version(pkg)}")
Step 1: Configure the LiteLLM Router
The router holds a list of model deployments. Each entry maps a logical name (the model_name you call) to a real provider endpoint. Priority ordering is controlled by priority: lower numbers are tried first. A fallback list tells the router which logical group to try if the primary group fails.
The configuration below defines three groups:
fastpoints atmistral/mistral-small-latest(low latency, low cost)capablepoints atmistral/mistral-large-latest(higher quality, higher cost)localpoints at an OpenAI-compatible endpoint onlocalhost:8001(vLLM or Ollama)
In a real deployment you’d add your vLLM base URL and API key to the local entry. For this tutorial the local entry is present in the config but the router will fall through to capable if it can’t reach it.
# filename: router_config.py
import os
MODEL_LIST = [
{
"model_name": "fast",
"litellm_params": {
"model": "mistral/mistral-small-latest",
"api_key": os.environ.get("MISTRAL_API_KEY", "placeholder"),
},
},
{
"model_name": "capable",
"litellm_params": {
"model": "mistral/mistral-large-latest",
"api_key": os.environ.get("MISTRAL_API_KEY", "placeholder"),
},
},
{
"model_name": "local",
"litellm_params": {
"model": "openai/local-model",
"api_base": os.environ.get("LOCAL_MODEL_BASE_URL", "http://localhost:8001/v1"),
"api_key": os.environ.get("LOCAL_MODEL_API_KEY", "placeholder"),
},
},
]
# Fallback chain: try fast first, then capable, skip local for fallback
FALLBACKS = [
{"fast": ["capable"]},
]
ROUTER_KWARGS = {
"model_list": MODEL_LIST,
"fallbacks": FALLBACKS,
"num_retries": 2,
"retry_after": 1,
"allowed_fails": 1,
"routing_strategy": "simple-shuffle",
}
Step 2: Build the OpenTelemetry TracerProvider
The SimpleSpanProcessor flushes each span synchronously as it closes. This is the right choice for local development and for the verification step later in this tutorial. In production, swap it for BatchSpanProcessor with an OTLP exporter pointed at your collector.
# filename: otel_setup.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
def build_tracer_provider(service_name: str = "litellm-router") -> TracerProvider:
resource = Resource(attributes={SERVICE_NAME: service_name})
provider = TracerProvider(resource=resource)
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)
return provider
def get_tracer(name: str = "litellm-router") -> trace.Tracer:
return trace.get_tracer(name)
To send spans to Grafana Tempo or any OTLP/HTTP endpoint instead of the console, replace ConsoleSpanExporter() with:
# Illustration only — not executed
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
exporter = OTLPSpanExporter(
endpoint="http://tempo:4318/v1/traces", # or your collector URL
headers={"Authorization": "Bearer YOUR_TOKEN"},
)
The span attributes written in Step 3 are identical regardless of which exporter you use.
Step 3: Wire the Cost-Attribution Callback
LiteLLM fires a success_callback after every successful completion. The callback receives a kwargs dict and a response_obj. The response_obj carries usage (prompt tokens, completion tokens) and LiteLLM computes response_cost in USD automatically.
The callback below retrieves the currently active OTel span and writes cost, token counts, model name, and latency as span attributes using the OpenInference semantic conventions for LLM spans.
# filename: cost_callback.py
import time
import litellm
from opentelemetry import trace
def on_llm_success(kwargs: dict, response_obj, start_time, end_time) -> None:
"""LiteLLM success callback that writes cost + usage onto the active OTel span."""
span = trace.get_current_span()
if span is None or not span.is_recording():
return
usage = getattr(response_obj, "usage", None)
if usage:
span.set_attribute("llm.token_count.prompt", getattr(usage, "prompt_tokens", 0))
span.set_attribute("llm.token_count.completion", getattr(usage, "completion_tokens", 0))
span.set_attribute("llm.token_count.total", getattr(usage, "total_tokens", 0))
cost = kwargs.get("response_cost", 0.0) or 0.0
span.set_attribute("llm.usage.cost_usd", round(cost, 8))
model = kwargs.get("model", "unknown")
span.set_attribute("llm.model_name", model)
latency_ms = (end_time - start_time).total_seconds() * 1000
span.set_attribute("llm.latency_ms", round(latency_ms, 2))
finish_reason = None
choices = getattr(response_obj, "choices", [])
if choices:
finish_reason = getattr(choices[0], "finish_reason", None)
if finish_reason:
span.set_attribute("llm.finish_reason", finish_reason)
def register_callbacks() -> None:
litellm.success_callback = [on_llm_success]
Step 4: Assemble the Instrumented Router
This module combines the three pieces: it initialises the tracer provider, registers the OpenInference auto-instrumentation for LiteLLM, registers the cost callback, and exposes a route_completion() function that wraps every call in a parent span.
The LiteLLMInstrumentor from openinference-instrumentation-litellm patches LiteLLM’s internal HTTP calls and writes additional span attributes (prompt content, finish reason, model name) automatically. The cost callback in Step 3 adds the cost attributes on top.
# filename: instrumented_router.py
from __future__ import annotations
import litellm
from litellm import Router
from opentelemetry import trace
from otel_setup import build_tracer_provider, get_tracer
from router_config import ROUTER_KWARGS
from cost_callback import register_callbacks
def build_router() -> tuple[Router, trace.Tracer]:
"""Initialise OTel, instrument LiteLLM, and return a configured Router."""
build_tracer_provider()
# Register the OpenInference auto-instrumentation
try:
from openinference.instrumentation.litellm import LiteLLMInstrumentor
LiteLLMInstrumentor().instrument()
except Exception as exc: # noqa: BLE001
print(f"[warn] LiteLLMInstrumentor not available: {exc}")
register_callbacks()
router = Router(**ROUTER_KWARGS)
tracer = get_tracer()
return router, tracer
def route_completion(
router: Router,
tracer: trace.Tracer,
model: str,
messages: list[dict],
**kwargs,
):
"""Run a completion inside a parent OTel span named after the logical model group."""
with tracer.start_as_current_span(f"router.completion/{model}") as span:
span.set_attribute("router.model_group", model)
span.set_attribute("router.message_count", len(messages))
response = router.completion(model=model, messages=messages, **kwargs)
return response
Step 5: Structural Verification (No API Key Required)
Before making any live calls, verify that the router initialises correctly and the OTel tracer is wired up. This block constructs the router without calling any external endpoint.
from instrumented_router import build_router, route_completion
from opentelemetry import trace
router, tracer = build_router()
# Confirm the router knows about our three model groups
model_names = {d["model_name"] for d in router.model_list}
print("Registered model groups:", sorted(model_names))
assert "fast" in model_names
assert "capable" in model_names
assert "local" in model_names
# Confirm the tracer is the one we registered
current_provider = trace.get_tracer_provider()
print("TracerProvider type:", type(current_provider).__name__)
assert "TracerProvider" in type(current_provider).__name__
print("structural_check_passed")
Step 6: Emit a Traced Span (No API Key Required)
This block emits a real OTel span with manually set cost attributes so you can see the console exporter output without needing a Mistral key. It exercises the same code path the callback uses.
import io
import sys
from otel_setup import build_tracer_provider, get_tracer
# Capture console exporter output
buffer = io.StringIO()
original_stdout = sys.stdout
sys.stdout = buffer
provider = build_tracer_provider(service_name="cost-trace-test")
tracer = get_tracer("cost-trace-test")
with tracer.start_as_current_span("router.completion/fast") as span:
span.set_attribute("router.model_group", "fast")
span.set_attribute("llm.model_name", "mistral/mistral-small-latest")
span.set_attribute("llm.token_count.prompt", 42)
span.set_attribute("llm.token_count.completion", 18)
span.set_attribute("llm.token_count.total", 60)
span.set_attribute("llm.usage.cost_usd", 0.00000720)
span.set_attribute("llm.latency_ms", 312.5)
span.set_attribute("llm.finish_reason", "stop")
sys.stdout = original_stdout
output = buffer.getvalue()
print(output[:2000]) # print first 2000 chars of the span JSON
assert "router.completion/fast" in output, "span name not found in output"
assert "llm.usage.cost_usd" in output, "cost attribute not found in output"
assert "llm.token_count.prompt" in output, "token attribute not found in output"
print("span_emission_verified")
Step 7: Live Call with Cost Tracing
This block makes a real completion call through the router. It requires MISTRAL_API_KEY to be set. The cost callback fires automatically after the response arrives and writes the cost attributes onto the active span.
import os
from instrumented_router import build_router, route_completion
# This block requires MISTRAL_API_KEY
router, tracer = build_router()
messages = [{"role": "user", "content": "Name the three primary colours. One word each."}]
response = route_completion(
router=router,
tracer=tracer,
model="fast",
messages=messages,
max_tokens=30,
)
print("Model used:", response.model)
print("Reply:", response.choices[0].message.content)
print("Prompt tokens:", response.usage.prompt_tokens)
print("Completion tokens:", response.usage.completion_tokens)
print("Cost USD:", getattr(response, "_hidden_params", {}).get("response_cost", "see span"))
Verify it Works
Run the structural and span-emission checks (no API key needed) to confirm the full pipeline is wired correctly:
import io, sys
from otel_setup import build_tracer_provider, get_tracer
from instrumented_router import build_router
from opentelemetry import trace
# --- structural check ---
router, tracer = build_router()
model_names = {d["model_name"] for d in router.model_list}
assert {"fast", "capable", "local"} <= model_names, f"Missing groups: {model_names}"
# --- span attribute check ---
buf = io.StringIO()
sys.stdout, _orig = buf, sys.stdout
try:
provider2 = build_tracer_provider("verify-service")
t2 = get_tracer("verify-service")
with t2.start_as_current_span("verify.span") as s:
s.set_attribute("llm.usage.cost_usd", 0.000042)
s.set_attribute("llm.token_count.total", 99)
finally:
sys.stdout = _orig
out = buf.getvalue()
assert "verify.span" in out, "span name missing"
assert "llm.usage.cost_usd" in out, "cost attribute missing"
assert "llm.token_count.total" in out, "token count attribute missing"
print("all_checks_passed")
Sending Spans to Grafana Tempo
For production, replace the ConsoleSpanExporter in otel_setup.py with the OTLP HTTP exporter. A minimal docker-compose.yml for a local Tempo + Grafana stack is shown below for reference (not executed in this tutorial’s sandbox):
# docker-compose.yml (reference only — requires Docker)
services:
tempo:
image: grafana/tempo:latest
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo.yaml:/etc/tempo.yaml
ports:
- "4318:4318" # OTLP HTTP
- "3200:3200" # Tempo query API
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
Once Tempo is running, update otel_setup.py:
# Illustration: swap ConsoleSpanExporter for OTLP in production
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
exporter = OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")
provider.add_span_processor(BatchSpanProcessor(exporter))
In Grafana, add a Tempo datasource pointing at http://tempo:3200, then query by llm.usage.cost_usd or router.model_group in the TraceQL explorer. SigNoz and Honeycomb accept the same OTLP payload; only the endpoint URL and auth header differ.
Troubleshooting
ModuleNotFoundError: openinference.instrumentation.litellm — The package name on PyPI is openinference-instrumentation-litellm. Run uv pip install openinference-instrumentation-litellm and confirm with from importlib.metadata import version; print(version("openinference-instrumentation-litellm")).
Cost attribute is 0.0 on every span — LiteLLM computes response_cost only for models it has pricing data for. Check litellm.model_cost for your model string. For custom or local models, set litellm.model_cost["openai/local-model"] = {"input_cost_per_token": 0.0, "output_cost_per_token": 0.0} before constructing the router.
Spans appear in the console but not in Tempo — Confirm the OTLP endpoint is reachable: curl -s http://localhost:4318/v1/traces -X POST -H 'Content-Type: application/json' -d '{}' should return a 400 (bad payload), not a connection refused. Also confirm you switched from SimpleSpanProcessor to BatchSpanProcessor and called provider.force_flush() before process exit.
Router raises litellm.exceptions.AuthenticationError on the local group — The local vLLM or Ollama endpoint isn’t running. Either start it or remove the local entry from MODEL_LIST. The fallback chain only covers the fast -> capable path by default.
LiteLLMInstrumentor emits duplicate spans — If you call LiteLLMInstrumentor().instrument() more than once (e.g. in a notebook that re-runs cells), call LiteLLMInstrumentor().uninstrument() first or guard with a module-level flag.
Callback fires but trace.get_current_span() returns a NonRecordingSpan — The callback runs outside the with tracer.start_as_current_span(...) context. Make sure route_completion() is the entry point rather than calling router.completion() directly.
Next Steps
- Add a budget guard: accumulate
llm.usage.cost_usdper user session in a Redis counter and raiselitellm.BudgetExceededErrorwhen the threshold is crossed. - Route by latency SLO: switch
routing_strategytolatency-based-routingand feed thellm.latency_msspan attribute into a Grafana alert that pages when p95 exceeds your SLO. - Instrument retries: LiteLLM also fires
failure_callbackandretry_callback. Attach span events (span.add_event("retry", {...})) to surface retry storms in your trace waterfall. - Export to SigNoz: SigNoz OSS runs entirely on-premise and accepts the same OTLP payload. Replace the Tempo endpoint with
http://otel-collector:4318/v1/tracesand use SigNoz’s built-in cost dashboard to aggregatellm.usage.cost_usdacross services.
FAQ
How does the cost callback attach cost data to OpenTelemetry spans?
The on_llm_success callback retrieves the currently active OTel span using trace.get_current_span() and writes cost, token counts, model name, and latency as span attributes using OpenInference semantic conventions. LiteLLM computes response_cost in USD automatically from its pricing metadata, which the callback then rounds and sets as llm.usage.cost_usd on the span.
What happens if a model endpoint fails in the router?
The router follows the fallback chain defined in FALLBACKS. For example, if the fast model group fails, the router automatically retries with the capable group. The num_retries and allowed_fails parameters control retry behavior and how many failures trigger a fallback.
Can the same span structure work with different observability backends?
Yes. The span attributes and structure remain identical regardless of exporter. You can swap ConsoleSpanExporter for OTLPSpanExporter pointed at Grafana Tempo, SigNoz, Honeycomb, or Datadog by changing only the exporter endpoint and auth headers; the span attributes like llm.usage.cost_usd index identically across all backends.
Why does the cost attribute show 0.0 for some models?
LiteLLM computes response_cost only for models it has pricing data for in its litellm.model_cost dictionary. For custom or local models without built-in pricing, you must manually set the cost data: litellm.model_cost["openai/local-model"] = {"input_cost_per_token": 0.0, "output_cost_per_token": 0.0} before constructing the router.
What is the difference between SimpleSpanProcessor and BatchSpanProcessor?
SimpleSpanProcessor flushes each span synchronously as it closes, suitable for local development and verification. BatchSpanProcessor collects spans in batches before sending them to the exporter, reducing overhead in production. For production use with Grafana Tempo or other OTLP receivers, swap to BatchSpanProcessor.