# LiteLLM Router with Per-Request Cost Tracing via OpenTelemetry

> Build a LiteLLM router that dispatches requests across multiple model backends and records per-request token cost and latency as OpenTelemetry spans, exportable to any OTLP-compatible backend. You get routing logic, cost attribution, and a queryable trace in under 200 lines of Python.

- Canonical URL: https://agentry.press/tutorial/litellm-router-with-per-request-cost-tracing-via-opentelemetry/
- Type: Tutorial
- Published: 2026-06-04
- By: agentry
- Tags: litellm, opentelemetry, cost-tracing, routing, observability

---

## Why this matters

Running agents across multiple model providers is now the default, not the exception. Teams mix a fast cheap model for classification, a capable cloud model for reasoning, and a self-hosted endpoint for data-residency requirements. LiteLLM's router handles the dispatch, but without instrumentation you're flying blind: you can't tell whether a latency spike comes from the remote provider, a cold vLLM instance, or a retry storm.

OpenLLMetry (the OpenTelemetry instrumentation layer for LLM calls) attaches token counts, model names, and finish reasons as span attributes on every call. Combined with LiteLLM's built-in cost metadata, you can compute per-request dollar cost inside the span itself and ship it to any OTLP receiver, whether that's a local console exporter during development or Grafana Tempo, SigNoz, or Honeycomb in production.

This tutorial wires all three pieces together: a LiteLLM `Router` with priority-ordered model groups, an OpenTelemetry `TracerProvider` with a synchronous console exporter for local verification, and a cost-attribution callback that writes `llm.usage.cost_usd` onto the active span. The same span structure indexes identically on Datadog or Honeycomb; only the exporter endpoint changes.

> [!PULLQUOTE]
> The same span structure indexes identically on Datadog or Honeycomb; only the exporter endpoint changes.

## Prerequisites

- Python 3.11 or 3.12
- Basic familiarity with OpenTelemetry concepts (tracer, span, exporter)
- A Mistral API key (for the live-call steps; structural steps run without one)
- Optional: a running vLLM endpoint or Ollama instance for the local-model leg
- Optional: Grafana Tempo or any OTLP/HTTP receiver for production export

## Setup

Install the required packages. `opentelemetry-sdk` provides the tracer and console exporter. `openinference-instrumentation-litellm` is the OpenLLMetry auto-instrumentation package for LiteLLM.

```bash
uv pip install "litellm>=1.40.0" \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-http \
  openinference-instrumentation-litellm \
  openinference-semantic-conventions
```

Verify the installs:

```python
from importlib.metadata import version
for pkg in [
    "litellm",
    "opentelemetry-sdk",
    "openinference-instrumentation-litellm",
    "openinference-semantic-conventions",
]:
    print(f"{pkg}: {version(pkg)}")
```

## Step 1: Configure the LiteLLM Router

The router holds a list of model deployments. Each entry maps a logical name (the `model_name` you call) to a real provider endpoint. Priority ordering is controlled by `priority`: lower numbers are tried first. A `fallback` list tells the router which logical group to try if the primary group fails.

The configuration below defines three groups:

- `fast` points at `mistral/mistral-small-latest` (low latency, low cost)
- `capable` points at `mistral/mistral-large-latest` (higher quality, higher cost)
- `local` points at an OpenAI-compatible endpoint on `localhost:8001` (vLLM or Ollama)

In a real deployment you'd add your vLLM base URL and API key to the `local` entry. For this tutorial the local entry is present in the config but the router will fall through to `capable` if it can't reach it.

```python
# filename: router_config.py
import os

MODEL_LIST = [
    {
        "model_name": "fast",
        "litellm_params": {
            "model": "mistral/mistral-small-latest",
            "api_key": os.environ.get("MISTRAL_API_KEY", "placeholder"),
        },
    },
    {
        "model_name": "capable",
        "litellm_params": {
            "model": "mistral/mistral-large-latest",
            "api_key": os.environ.get("MISTRAL_API_KEY", "placeholder"),
        },
    },
    {
        "model_name": "local",
        "litellm_params": {
            "model": "openai/local-model",
            "api_base": os.environ.get("LOCAL_MODEL_BASE_URL", "http://localhost:8001/v1"),
            "api_key": os.environ.get("LOCAL_MODEL_API_KEY", "placeholder"),
        },
    },
]

# Fallback chain: try fast first, then capable, skip local for fallback
FALLBACKS = [
    {"fast": ["capable"]},
]

ROUTER_KWARGS = {
    "model_list": MODEL_LIST,
    "fallbacks": FALLBACKS,
    "num_retries": 2,
    "retry_after": 1,
    "allowed_fails": 1,
    "routing_strategy": "simple-shuffle",
}
```

## Step 2: Build the OpenTelemetry TracerProvider

The `SimpleSpanProcessor` flushes each span synchronously as it closes. This is the right choice for local development and for the verification step later in this tutorial. In production, swap it for `BatchSpanProcessor` with an OTLP exporter pointed at your collector.

```python
# filename: otel_setup.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource, SERVICE_NAME


def build_tracer_provider(service_name: str = "litellm-router") -> TracerProvider:
    resource = Resource(attributes={SERVICE_NAME: service_name})
    provider = TracerProvider(resource=resource)
    provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
    trace.set_tracer_provider(provider)
    return provider


def get_tracer(name: str = "litellm-router") -> trace.Tracer:
    return trace.get_tracer(name)
```

To send spans to Grafana Tempo or any OTLP/HTTP endpoint instead of the console, replace `ConsoleSpanExporter()` with:

```python
# Illustration only — not executed
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

exporter = OTLPSpanExporter(
    endpoint="http://tempo:4318/v1/traces",  # or your collector URL
    headers={"Authorization": "Bearer YOUR_TOKEN"},
)
```

The span attributes written in Step 3 are identical regardless of which exporter you use.

## Step 3: Wire the Cost-Attribution Callback

LiteLLM fires a `success_callback` after every successful completion. The callback receives a `kwargs` dict and a `response_obj`. The `response_obj` carries `usage` (prompt tokens, completion tokens) and LiteLLM computes `response_cost` in USD automatically.

The callback below retrieves the currently active OTel span and writes cost, token counts, model name, and latency as span attributes using the OpenInference semantic conventions for LLM spans.

```python
# filename: cost_callback.py
import time
import litellm
from opentelemetry import trace


def on_llm_success(kwargs: dict, response_obj, start_time, end_time) -> None:
    """LiteLLM success callback that writes cost + usage onto the active OTel span."""
    span = trace.get_current_span()
    if span is None or not span.is_recording():
        return

    usage = getattr(response_obj, "usage", None)
    if usage:
        span.set_attribute("llm.token_count.prompt", getattr(usage, "prompt_tokens", 0))
        span.set_attribute("llm.token_count.completion", getattr(usage, "completion_tokens", 0))
        span.set_attribute("llm.token_count.total", getattr(usage, "total_tokens", 0))

    cost = kwargs.get("response_cost", 0.0) or 0.0
    span.set_attribute("llm.usage.cost_usd", round(cost, 8))

    model = kwargs.get("model", "unknown")
    span.set_attribute("llm.model_name", model)

    latency_ms = (end_time - start_time).total_seconds() * 1000
    span.set_attribute("llm.latency_ms", round(latency_ms, 2))

    finish_reason = None
    choices = getattr(response_obj, "choices", [])
    if choices:
        finish_reason = getattr(choices[0], "finish_reason", None)
    if finish_reason:
        span.set_attribute("llm.finish_reason", finish_reason)


def register_callbacks() -> None:
    litellm.success_callback = [on_llm_success]
```

## Step 4: Assemble the Instrumented Router

This module combines the three pieces: it initialises the tracer provider, registers the OpenInference auto-instrumentation for LiteLLM, registers the cost callback, and exposes a `route_completion()` function that wraps every call in a parent span.

The `LiteLLMInstrumentor` from `openinference-instrumentation-litellm` patches LiteLLM's internal HTTP calls and writes additional span attributes (prompt content, finish reason, model name) automatically. The cost callback in Step 3 adds the cost attributes on top.

```python
# filename: instrumented_router.py
from __future__ import annotations

import litellm
from litellm import Router
from opentelemetry import trace

from otel_setup import build_tracer_provider, get_tracer
from router_config import ROUTER_KWARGS
from cost_callback import register_callbacks


def build_router() -> tuple[Router, trace.Tracer]:
    """Initialise OTel, instrument LiteLLM, and return a configured Router."""
    build_tracer_provider()

    # Register the OpenInference auto-instrumentation
    try:
        from openinference.instrumentation.litellm import LiteLLMInstrumentor
        LiteLLMInstrumentor().instrument()
    except Exception as exc:  # noqa: BLE001
        print(f"[warn] LiteLLMInstrumentor not available: {exc}")

    register_callbacks()

    router = Router(**ROUTER_KWARGS)
    tracer = get_tracer()
    return router, tracer


def route_completion(
    router: Router,
    tracer: trace.Tracer,
    model: str,
    messages: list[dict],
    **kwargs,
):
    """Run a completion inside a parent OTel span named after the logical model group."""
    with tracer.start_as_current_span(f"router.completion/{model}") as span:
        span.set_attribute("router.model_group", model)
        span.set_attribute("router.message_count", len(messages))
        response = router.completion(model=model, messages=messages, **kwargs)
        return response
```

## Step 5: Structural Verification (No API Key Required)

Before making any live calls, verify that the router initialises correctly and the OTel tracer is wired up. This block constructs the router without calling any external endpoint.

```python
from instrumented_router import build_router, route_completion
from opentelemetry import trace

router, tracer = build_router()

# Confirm the router knows about our three model groups
model_names = {d["model_name"] for d in router.model_list}
print("Registered model groups:", sorted(model_names))
assert "fast" in model_names
assert "capable" in model_names
assert "local" in model_names

# Confirm the tracer is the one we registered
current_provider = trace.get_tracer_provider()
print("TracerProvider type:", type(current_provider).__name__)
assert "TracerProvider" in type(current_provider).__name__

print("structural_check_passed")
```

## Step 6: Emit a Traced Span (No API Key Required)

This block emits a real OTel span with manually set cost attributes so you can see the console exporter output without needing a Mistral key. It exercises the same code path the callback uses.

```python
import io
import sys
from otel_setup import build_tracer_provider, get_tracer

# Capture console exporter output
buffer = io.StringIO()
original_stdout = sys.stdout
sys.stdout = buffer

provider = build_tracer_provider(service_name="cost-trace-test")
tracer = get_tracer("cost-trace-test")

with tracer.start_as_current_span("router.completion/fast") as span:
    span.set_attribute("router.model_group", "fast")
    span.set_attribute("llm.model_name", "mistral/mistral-small-latest")
    span.set_attribute("llm.token_count.prompt", 42)
    span.set_attribute("llm.token_count.completion", 18)
    span.set_attribute("llm.token_count.total", 60)
    span.set_attribute("llm.usage.cost_usd", 0.00000720)
    span.set_attribute("llm.latency_ms", 312.5)
    span.set_attribute("llm.finish_reason", "stop")

sys.stdout = original_stdout
output = buffer.getvalue()

print(output[:2000])  # print first 2000 chars of the span JSON
assert "router.completion/fast" in output, "span name not found in output"
assert "llm.usage.cost_usd" in output, "cost attribute not found in output"
assert "llm.token_count.prompt" in output, "token attribute not found in output"
print("span_emission_verified")
```

## Step 7: Live Call with Cost Tracing

This block makes a real completion call through the router. It requires `MISTRAL_API_KEY` to be set. The cost callback fires automatically after the response arrives and writes the cost attributes onto the active span.

```python
import os
from instrumented_router import build_router, route_completion

# This block requires MISTRAL_API_KEY
router, tracer = build_router()

messages = [{"role": "user", "content": "Name the three primary colours. One word each."}]

response = route_completion(
    router=router,
    tracer=tracer,
    model="fast",
    messages=messages,
    max_tokens=30,
)

print("Model used:", response.model)
print("Reply:", response.choices[0].message.content)
print("Prompt tokens:", response.usage.prompt_tokens)
print("Completion tokens:", response.usage.completion_tokens)
print("Cost USD:", getattr(response, "_hidden_params", {}).get("response_cost", "see span"))
```

## Verify it Works

Run the structural and span-emission checks (no API key needed) to confirm the full pipeline is wired correctly:

```python
import io, sys
from otel_setup import build_tracer_provider, get_tracer
from instrumented_router import build_router
from opentelemetry import trace

# --- structural check ---
router, tracer = build_router()
model_names = {d["model_name"] for d in router.model_list}
assert {"fast", "capable", "local"} <= model_names, f"Missing groups: {model_names}"

# --- span attribute check ---
buf = io.StringIO()
sys.stdout, _orig = buf, sys.stdout
try:
    provider2 = build_tracer_provider("verify-service")
    t2 = get_tracer("verify-service")
    with t2.start_as_current_span("verify.span") as s:
        s.set_attribute("llm.usage.cost_usd", 0.000042)
        s.set_attribute("llm.token_count.total", 99)
finally:
    sys.stdout = _orig

out = buf.getvalue()
assert "verify.span" in out, "span name missing"
assert "llm.usage.cost_usd" in out, "cost attribute missing"
assert "llm.token_count.total" in out, "token count attribute missing"

print("all_checks_passed")
```

## Sending Spans to Grafana Tempo

For production, replace the `ConsoleSpanExporter` in `otel_setup.py` with the OTLP HTTP exporter. A minimal `docker-compose.yml` for a local Tempo + Grafana stack is shown below for reference (not executed in this tutorial's sandbox):

```yaml
# docker-compose.yml (reference only — requires Docker)
services:
  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo.yaml:/etc/tempo.yaml
    ports:
      - "4318:4318"   # OTLP HTTP
      - "3200:3200"   # Tempo query API
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
```

Once Tempo is running, update `otel_setup.py`:

```python
# Illustration: swap ConsoleSpanExporter for OTLP in production
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

exporter = OTLPSpanExporter(endpoint="http://localhost:4318/v1/traces")
provider.add_span_processor(BatchSpanProcessor(exporter))
```

In Grafana, add a Tempo datasource pointing at `http://tempo:3200`, then query by `llm.usage.cost_usd` or `router.model_group` in the TraceQL explorer. SigNoz and Honeycomb accept the same OTLP payload; only the endpoint URL and auth header differ.

## Troubleshooting

**`ModuleNotFoundError: openinference.instrumentation.litellm`** — The package name on PyPI is `openinference-instrumentation-litellm`. Run `uv pip install openinference-instrumentation-litellm` and confirm with `from importlib.metadata import version; print(version("openinference-instrumentation-litellm"))`.

**Cost attribute is `0.0` on every span** — LiteLLM computes `response_cost` only for models it has pricing data for. Check `litellm.model_cost` for your model string. For custom or local models, set `litellm.model_cost["openai/local-model"] = {"input_cost_per_token": 0.0, "output_cost_per_token": 0.0}` before constructing the router.

**Spans appear in the console but not in Tempo** — Confirm the OTLP endpoint is reachable: `curl -s http://localhost:4318/v1/traces -X POST -H 'Content-Type: application/json' -d '{}'` should return a 400 (bad payload), not a connection refused. Also confirm you switched from `SimpleSpanProcessor` to `BatchSpanProcessor` and called `provider.force_flush()` before process exit.

**Router raises `litellm.exceptions.AuthenticationError` on the `local` group** — The local vLLM or Ollama endpoint isn't running. Either start it or remove the `local` entry from `MODEL_LIST`. The fallback chain only covers the `fast` -> `capable` path by default.

**`LiteLLMInstrumentor` emits duplicate spans** — If you call `LiteLLMInstrumentor().instrument()` more than once (e.g. in a notebook that re-runs cells), call `LiteLLMInstrumentor().uninstrument()` first or guard with a module-level flag.

**Callback fires but `trace.get_current_span()` returns a `NonRecordingSpan`** — The callback runs outside the `with tracer.start_as_current_span(...)` context. Make sure `route_completion()` is the entry point rather than calling `router.completion()` directly.

## Next Steps

- **Add a budget guard**: accumulate `llm.usage.cost_usd` per user session in a Redis counter and raise `litellm.BudgetExceededError` when the threshold is crossed.
- **Route by latency SLO**: switch `routing_strategy` to `latency-based-routing` and feed the `llm.latency_ms` span attribute into a Grafana alert that pages when p95 exceeds your SLO.
- **Instrument retries**: LiteLLM also fires `failure_callback` and `retry_callback`. Attach span events (`span.add_event("retry", {...})`) to surface retry storms in your trace waterfall.
- **Export to SigNoz**: SigNoz OSS runs entirely on-premise and accepts the same OTLP payload. Replace the Tempo endpoint with `http://otel-collector:4318/v1/traces` and use SigNoz's built-in cost dashboard to aggregate `llm.usage.cost_usd` across services.

## FAQ

### How does the cost callback attach cost data to OpenTelemetry spans?

The `on_llm_success` callback retrieves the currently active OTel span using `trace.get_current_span()` and writes cost, token counts, model name, and latency as span attributes using OpenInference semantic conventions. LiteLLM computes `response_cost` in USD automatically from its pricing metadata, which the callback then rounds and sets as `llm.usage.cost_usd` on the span.

### What happens if a model endpoint fails in the router?

The router follows the fallback chain defined in `FALLBACKS`. For example, if the `fast` model group fails, the router automatically retries with the `capable` group. The `num_retries` and `allowed_fails` parameters control retry behavior and how many failures trigger a fallback.

### Can the same span structure work with different observability backends?

Yes. The span attributes and structure remain identical regardless of exporter. You can swap `ConsoleSpanExporter` for `OTLPSpanExporter` pointed at Grafana Tempo, SigNoz, Honeycomb, or Datadog by changing only the exporter endpoint and auth headers; the span attributes like `llm.usage.cost_usd` index identically across all backends.

### Why does the cost attribute show 0.0 for some models?

LiteLLM computes `response_cost` only for models it has pricing data for in its `litellm.model_cost` dictionary. For custom or local models without built-in pricing, you must manually set the cost data: `litellm.model_cost["openai/local-model"] = {"input_cost_per_token": 0.0, "output_cost_per_token": 0.0}` before constructing the router.

### What is the difference between SimpleSpanProcessor and BatchSpanProcessor?

SimpleSpanProcessor flushes each span synchronously as it closes, suitable for local development and verification. BatchSpanProcessor collects spans in batches before sending them to the exporter, reducing overhead in production. For production use with Grafana Tempo or other OTLP receivers, swap to BatchSpanProcessor.

## References

1. https://www.reddit.com/r/LLMDevs/comments/1tagwwf/the_gap_between_the_model_returned_json_and_the/
2. https://arxiv.org/abs/2605.17774v1