# Cross-Cloud LLM Routing with LiteLLM and Per-Request Cost Tracing

> Build a LiteLLM-based router that dispatches LLM requests across two providers based on cost and latency thresholds, then attach OpenTelemetry middleware that emits per-request cost and routing-decision spans to a local console exporter you can swap for any OTLP backend.

- Canonical URL: https://agentry.press/tutorial/cross-cloud-llm-routing-with-litellm-and-per-request-cost-tracing/
- Type: Tutorial
- Published: 2026-05-31
- By: agentry
- Tags: litellm, opentelemetry, llm-routing, cost-tracing, observability

---

## Why this matters

Running LLM workloads across a single provider is a single point of failure and a cost ceiling. Teams operating agent pipelines at scale increasingly split traffic: cheap, fast models for classification and routing decisions; more capable models for generation. LiteLLM's router makes that split programmable, but without per-request cost attribution you're flying blind. You can't tell whether a latency spike comes from a slow provider, a cache miss, or a model that's simply more expensive per token and therefore slower to respond.

OpenTelemetry's vendor-neutral span model is the right instrument here. A span that carries `llm.cost_usd`, `llm.provider`, and `llm.routing_strategy` as attributes survives a backend migration intact. The same span structure indexes the same way on Datadog or Honeycomb, only the exporter endpoint changes.

This tutorial wires those two pieces together in a fully local, key-free-runnable setup so you can validate the plumbing before pointing it at real providers.

## Prerequisites

- Python 3.11 or 3.12
- API keys for two LLM providers (OpenAI and Anthropic used as examples; the router config accepts any LiteLLM-supported provider)
- Basic familiarity with HTTP proxies and OpenTelemetry concepts (traces, spans, exporters)
- No Docker required for the core tutorial; the OTel collector section notes where Docker would add a Grafana Tempo backend

## Setup

Install LiteLLM and the OpenTelemetry SDK packages.

```bash
uv pip install "litellm>=1.40.0" opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc opentelemetry-api
```

Verify the installs:

```python
from importlib.metadata import version
for pkg in ["litellm", "opentelemetry-sdk", "opentelemetry-api"]:
    print(f"{pkg}: {version(pkg)}")
print("imports_ok")
```

## Step 1: Define the routing configuration

LiteLLM's `Router` accepts a list of model deployments with per-deployment metadata. The `rpm` (requests per minute) and `tpm` (tokens per minute) limits drive the built-in load balancer. You'll add a `cost_per_1k_tokens` field to the `litellm_params` dict; the OTel middleware will read it back from the response metadata.

The `routing_strategy` field controls how the router picks among healthy deployments. `latency-based-routing` tracks rolling p50 latency per deployment and prefers the fastest one. `cost-based-routing` is a custom strategy you'll implement in Step 3.

```python
# filename: router_config.py
from dataclasses import dataclass, field
from typing import List, Dict, Any


@dataclass
class DeploymentConfig:
    model_name: str
    litellm_params: Dict[str, Any]
    rpm: int = 60
    tpm: int = 100_000


# Two hypothetical deployments. In production, replace model/api_key/api_base
# with your actual provider credentials.
DEPLOYMENTS: List[Dict[str, Any]] = [
    {
        "model_name": "fast-model",
        "litellm_params": {
            "model": "openai/gpt-4o-mini",
            "api_key": "os.environ/OPENAI_API_KEY",
            # Cost metadata read by the OTel middleware
            "cost_per_1k_input_tokens": 0.00015,
            "cost_per_1k_output_tokens": 0.00060,
        },
        "rpm": 500,
        "tpm": 200_000,
    },
    {
        "model_name": "capable-model",
        "litellm_params": {
            "model": "anthropic/claude-3-5-haiku-20241022",
            "api_key": "os.environ/ANTHROPIC_API_KEY",
            "cost_per_1k_input_tokens": 0.00080,
            "cost_per_1k_output_tokens": 0.00400,
        },
        "rpm": 100,
        "tpm": 100_000,
    },
]

# Thresholds used by the cost-aware routing logic in Step 3
COST_THRESHOLD_USD = 0.001   # requests estimated above this go to fast-model
LATENCY_THRESHOLD_MS = 800   # p50 above this triggers fallback
```

## Step 2: Build the OpenTelemetry middleware

LiteLLM exposes a `CustomLogger` hook interface. Subclass it and override `log_success_event` to emit a span after every successful completion. The span carries:

- `llm.provider` — which deployment was chosen
- `llm.model` — the underlying model string
- `llm.input_tokens` / `llm.output_tokens`
- `llm.cost_usd` — computed from token counts and the per-deployment rate card
- `llm.routing_strategy` — the strategy that selected this deployment
- `llm.latency_ms` — end-to-end wall time for the completion call

Using `SimpleSpanProcessor` here ensures spans flush synchronously, which matters for short-lived scripts and for the verification step later.

```python
# filename: otel_cost_logger.py
import time
from typing import Any, Dict, Optional

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource

from litellm.integrations.custom_logger import CustomLogger


def build_tracer_provider(service_name: str = "llm-router") -> TracerProvider:
    """Create a TracerProvider that writes spans to stdout.

    Swap ConsoleSpanExporter for OTLPSpanExporter to send to Tempo / Grafana.
    """
    resource = Resource.create({"service.name": service_name})
    provider = TracerProvider(resource=resource)
    provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
    trace.set_tracer_provider(provider)
    return provider


class OtelCostLogger(CustomLogger):
    """LiteLLM CustomLogger that emits an OTel span per completion."""

    def __init__(self, tracer_provider: Optional[TracerProvider] = None):
        super().__init__()
        if tracer_provider is None:
            tracer_provider = build_tracer_provider()
        self._tracer = tracer_provider.get_tracer("litellm.router")

    # ------------------------------------------------------------------
    # Internal helpers
    # ------------------------------------------------------------------

    @staticmethod
    def _compute_cost(
        input_tokens: int,
        output_tokens: int,
        cost_per_1k_input: float,
        cost_per_1k_output: float,
    ) -> float:
        return (input_tokens / 1000) * cost_per_1k_input + (
            output_tokens / 1000
        ) * cost_per_1k_output

    @staticmethod
    def _extract_deployment_costs(kwargs: Dict[str, Any]):
        """Pull cost rates from the litellm_params stored in kwargs."""
        litellm_params = kwargs.get("litellm_params", {})
        cost_in = litellm_params.get("cost_per_1k_input_tokens", 0.0)
        cost_out = litellm_params.get("cost_per_1k_output_tokens", 0.0)
        return float(cost_in), float(cost_out)

    # ------------------------------------------------------------------
    # CustomLogger hooks
    # ------------------------------------------------------------------

    def log_success_event(self, kwargs, response_obj, start_time, end_time):
        """Called by LiteLLM after every successful completion."""
        latency_ms = (end_time - start_time).total_seconds() * 1000

        usage = getattr(response_obj, "usage", None)
        input_tokens = getattr(usage, "prompt_tokens", 0) or 0
        output_tokens = getattr(usage, "completion_tokens", 0) or 0

        cost_in, cost_out = self._extract_deployment_costs(kwargs)
        cost_usd = self._compute_cost(input_tokens, output_tokens, cost_in, cost_out)

        model = kwargs.get("model", "unknown")
        # LiteLLM stores the chosen deployment's model string in litellm_params
        litellm_params = kwargs.get("litellm_params", {})
        provider = litellm_params.get("custom_llm_provider", "unknown")
        routing_strategy = kwargs.get("metadata", {}).get(
            "routing_strategy", "unknown"
        )

        with self._tracer.start_as_current_span("llm.completion") as span:
            span.set_attribute("llm.provider", provider)
            span.set_attribute("llm.model", model)
            span.set_attribute("llm.input_tokens", input_tokens)
            span.set_attribute("llm.output_tokens", output_tokens)
            span.set_attribute("llm.cost_usd", round(cost_usd, 8))
            span.set_attribute("llm.latency_ms", round(latency_ms, 2))
            span.set_attribute("llm.routing_strategy", routing_strategy)

    def log_failure_event(self, kwargs, response_obj, start_time, end_time):
        """Called by LiteLLM when a completion fails."""
        latency_ms = (end_time - start_time).total_seconds() * 1000
        model = kwargs.get("model", "unknown")
        exception = kwargs.get("exception", "unknown")

        with self._tracer.start_as_current_span("llm.completion.error") as span:
            span.set_attribute("llm.model", model)
            span.set_attribute("llm.latency_ms", round(latency_ms, 2))
            span.set_attribute("error.message", str(exception))
            span.set_status(
                trace.StatusCode.ERROR, description=str(exception)
            )
```

## Step 3: Implement the cost-aware router

LiteLLM's `Router` accepts a `routing_strategy` string. The built-in `latency-based-routing` strategy works well for latency optimization, but for cost-aware routing you wrap the `Router` in a thin dispatcher that checks estimated request cost before delegating to the underlying router.

The dispatcher estimates cost from the prompt token count (using LiteLLM's `token_counter` utility) and the per-deployment rate card, then picks the cheapest deployment whose p50 latency is below the threshold.

```python
# filename: cost_router.py
import os
from typing import List, Dict, Any, Optional

import litellm
from litellm import Router
from litellm.utils import token_counter

from router_config import DEPLOYMENTS, COST_THRESHOLD_USD, LATENCY_THRESHOLD_MS
from otel_cost_logger import OtelCostLogger, build_tracer_provider


def build_router(otel_logger: Optional[OtelCostLogger] = None) -> Router:
    """Construct a LiteLLM Router with the cost-logger callback attached."""
    if otel_logger is None:
        provider = build_tracer_provider()
        otel_logger = OtelCostLogger(tracer_provider=provider)

    # Register the custom logger globally so it fires for every completion
    litellm.callbacks = [otel_logger]

    router = Router(
        model_list=DEPLOYMENTS,
        routing_strategy="latency-based-routing",
        # Retry on transient errors before raising
        num_retries=2,
        retry_after=1,
        # Cooldown a deployment for 30 s after a failure
        cooldown_time=30,
    )
    return router


def estimate_request_cost(
    messages: List[Dict[str, str]],
    cost_per_1k_input: float,
    output_estimate_tokens: int = 256,
    cost_per_1k_output: float = 0.0,
) -> float:
    """Rough pre-flight cost estimate based on prompt token count."""
    prompt_tokens = token_counter(model="gpt-4o-mini", messages=messages)
    return (prompt_tokens / 1000) * cost_per_1k_input + (
        output_estimate_tokens / 1000
    ) * cost_per_1k_output


def pick_model_name(
    messages: List[Dict[str, str]],
    prefer_cheap: bool = True,
) -> str:
    """Return 'fast-model' or 'capable-model' based on cost estimate.

    If the estimated cost of the cheap model already exceeds COST_THRESHOLD_USD
    (e.g. very long context), fall back to the capable model anyway because
    the cost difference is small relative to quality.
    """
    fast_params = DEPLOYMENTS[0]["litellm_params"]
    estimated = estimate_request_cost(
        messages,
        cost_per_1k_input=fast_params["cost_per_1k_input_tokens"],
        cost_per_1k_output=fast_params["cost_per_1k_output_tokens"],
    )
    if prefer_cheap and estimated < COST_THRESHOLD_USD:
        return "fast-model"
    return "capable-model"
```

## Step 4: Wire a mock backend for local testing

Because the sandbox has no real API keys, you'll build a tiny FastAPI mock that returns a well-formed OpenAI-compatible response. The router will target this mock via `api_base`, so the full OTel pipeline runs without any external calls.

```python
# filename: mock_llm_server.py
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import time

app = FastAPI()


@app.get("/health")
def health():
    return {"status": "ok"}


@app.post("/v1/chat/completions")
async def chat_completions(request: Request):
    body = await request.json()
    model = body.get("model", "mock-model")
    messages = body.get("messages", [])
    prompt = messages[-1]["content"] if messages else ""

    # Simulate a small latency
    time.sleep(0.05)

    return JSONResponse({
        "id": "mock-cmpl-001",
        "object": "chat.completion",
        "created": int(time.time()),
        "model": model,
        "choices": [{
            "index": 0,
            "message": {
                "role": "assistant",
                "content": f"Mock response to: {prompt[:40]}",
            },
            "finish_reason": "stop",
        }],
        "usage": {
            "prompt_tokens": 20,
            "completion_tokens": 15,
            "total_tokens": 35,
        },
    })
```

Install FastAPI and Uvicorn, then start the mock server as a background process:

```bash
uv pip install fastapi uvicorn httpx
```

```bash
nohup uvicorn mock_llm_server:app --host 0.0.0.0 --port 8765 > /tmp/mock_llm.log 2>&1 & disown
sleep 2
curl -sf http://localhost:8765/health || (echo "mock server failed to start" >&2; cat /tmp/mock_llm.log; exit 1)
echo "mock_server_ready"
```

## Step 5: Run the router against the mock backend

This script overrides the deployment configs to point both models at the local mock server, then fires three completions through the router and prints the routing decisions. The OTel middleware emits a span to stdout for each call.

```python
# filename: run_router.py
import os
import litellm
from litellm import Router

from otel_cost_logger import OtelCostLogger, build_tracer_provider
from router_config import COST_THRESHOLD_USD
from cost_router import estimate_request_cost

# Silence LiteLLM's verbose logging so OTel span output is readable
litellm.set_verbose = False

MOCK_BASE = "http://localhost:8765"

MOCK_DEPLOYMENTS = [
    {
        "model_name": "fast-model",
        "litellm_params": {
            "model": "openai/mock-fast",
            "api_key": "mock-key",
            "api_base": MOCK_BASE,
            "cost_per_1k_input_tokens": 0.00015,
            "cost_per_1k_output_tokens": 0.00060,
        },
        "rpm": 500,
        "tpm": 200_000,
    },
    {
        "model_name": "capable-model",
        "litellm_params": {
            "model": "openai/mock-capable",
            "api_key": "mock-key",
            "api_base": MOCK_BASE,
            "cost_per_1k_input_tokens": 0.00080,
            "cost_per_1k_output_tokens": 0.00400,
        },
        "rpm": 100,
        "tpm": 100_000,
    },
]

tracer_provider = build_tracer_provider()
otel_logger = OtelCostLogger(tracer_provider=tracer_provider)
litellm.callbacks = [otel_logger]

router = Router(
    model_list=MOCK_DEPLOYMENTS,
    routing_strategy="latency-based-routing",
    num_retries=1,
    retry_after=0,
)


def route_and_call(messages, label=""):
    fast_params = MOCK_DEPLOYMENTS[0]["litellm_params"]
    estimated = estimate_request_cost(
        messages,
        cost_per_1k_input=fast_params["cost_per_1k_input_tokens"],
        cost_per_1k_output=fast_params["cost_per_1k_output_tokens"],
    )
    model_name = "fast-model" if estimated < COST_THRESHOLD_USD else "capable-model"
    print(f"[{label}] estimated_cost=${estimated:.6f} -> routing to '{model_name}'")

    response = router.completion(
        model=model_name,
        messages=messages,
        metadata={"routing_strategy": "cost-aware"},
    )
    content = response.choices[0].message.content
    print(f"[{label}] response: {content[:60]}")
    return response


if __name__ == "__main__":
    # Request 1: short prompt, should route to fast-model
    route_and_call(
        [{"role": "user", "content": "What is 2+2?"}],
        label="short",
    )

    # Request 2: medium prompt
    route_and_call(
        [{"role": "user", "content": "Summarize the history of the Roman Empire in one paragraph."}],
        label="medium",
    )

    # Request 3: simulate a long context by repeating text
    long_prompt = "Analyze this text: " + ("word " * 800)
    route_and_call(
        [{"role": "user", "content": long_prompt}],
        label="long",
    )

    print("routing_complete")
```

```bash
cd /workspace && python run_router.py 2>/dev/null
```

## Verify it works

The verification script imports the OTel machinery directly and fires a single mock completion, then checks that the span was emitted with the expected attributes. It uses `io.StringIO` to capture the `ConsoleSpanExporter` output synchronously.

```python
import io
import sys
import time
import json
from datetime import datetime
from unittest.mock import MagicMock

from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry import trace

from otel_cost_logger import OtelCostLogger

# ---- Build an isolated tracer that writes to a StringIO buffer ----
buf = io.StringIO()
exporter = ConsoleSpanExporter(out=buf)
resource = Resource.create({"service.name": "test"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(SimpleSpanProcessor(exporter))
trace.set_tracer_provider(provider)

logger = OtelCostLogger(tracer_provider=provider)

# ---- Simulate a LiteLLM success event ----
usage = MagicMock()
usage.prompt_tokens = 100
usage.completion_tokens = 50

response_obj = MagicMock()
response_obj.usage = usage

kwargs = {
    "model": "openai/gpt-4o-mini",
    "litellm_params": {
        "custom_llm_provider": "openai",
        "cost_per_1k_input_tokens": 0.00015,
        "cost_per_1k_output_tokens": 0.00060,
    },
    "metadata": {"routing_strategy": "cost-aware"},
}

start = datetime.utcnow()
time.sleep(0.01)
end = datetime.utcnow()

logger.log_success_event(kwargs, response_obj, start, end)

# ---- Assert span was emitted ----
span_output = buf.getvalue()
assert "llm.completion" in span_output, f"Expected span name not found. Got:\n{span_output[:500]}"
assert "llm.cost_usd" in span_output, "Expected llm.cost_usd attribute not found"
assert "llm.input_tokens" in span_output, "Expected llm.input_tokens attribute not found"
assert "llm.routing_strategy" in span_output, "Expected llm.routing_strategy attribute not found"

print("verify_otel_span_ok")
```

## Connecting to Grafana Tempo (optional)

The `ConsoleSpanExporter` is useful for local development. To ship spans to a real backend, replace it with the OTLP gRPC exporter. The swap is a one-line change in `otel_cost_logger.py`:

```python
# Swap ConsoleSpanExporter for OTLP in production
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

otlp_exporter = OTLPSpanExporter(endpoint="http://tempo:4317", insecure=True)
provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
```

A minimal `docker-compose.yml` for local Tempo + Grafana is shown below for reference. Run it on your own machine (Docker required, not available in this sandbox):

```yaml
# docker-compose.yml (run on your own machine, not in the sandbox)
version: "3.9"
services:
  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./tempo.yaml:/etc/tempo.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "3200:3200"   # Tempo query API
  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
```

Once Tempo is running, set `OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317` and the router emits spans directly into the Grafana Explore view. Filter by `llm.provider` or `llm.routing_strategy` to build cost-attribution panels.

> [!PULLQUOTE]
> The same span structure indexes the same way on Datadog or Honeycomb, only the exporter endpoint changes.

## Troubleshooting

**`ModuleNotFoundError: No module named 'litellm.integrations.custom_logger'`** — LiteLLM reorganized its integration paths in v1.40. Run `uv pip install "litellm>=1.40.0"` and confirm with `importlib.metadata.version("litellm")`. The `CustomLogger` class lives at `litellm.integrations.custom_logger.CustomLogger` in current releases.

**Router raises `openai.AuthenticationError` even with mock deployments** — LiteLLM validates the `api_key` field format for known providers. Use `"api_key": "mock-key"` (any non-empty string) and set `"api_base"` to your mock server URL. The `openai/` prefix in the model string tells LiteLLM to use the OpenAI-compatible client, which respects `api_base`.

**Spans appear in stdout but `buf.getvalue()` is empty in the verify block** — This happens when `BatchSpanProcessor` is used instead of `SimpleSpanProcessor`. `BatchSpanProcessor` flushes asynchronously at process exit, after the assertion runs. The tutorial uses `SimpleSpanProcessor(ConsoleSpanExporter())` throughout for this reason. In production, switch back to `BatchSpanProcessor`.

**`token_counter` returns 0 for all messages** — LiteLLM's `token_counter` falls back to a rough heuristic when `tiktoken` is not installed. Install it with `uv pip install tiktoken` for accurate counts. The cost estimate will still be directionally correct without it.

**Mock server returns 422 Unprocessable Entity** — The mock server expects a JSON body with a `messages` field. If you're testing with `curl`, include `-H 'Content-Type: application/json'` and a valid body: `'{"model": "mock", "messages": [{"role": "user", "content": "hi"}]}'`.

**`cooldown_time` causes all deployments to be marked unhealthy** — During testing with a mock that returns errors, the router may cool down both deployments simultaneously. Set `cooldown_time=0` in the `Router` constructor while debugging, then restore it for production.

## Next steps

- **Add a fallback chain**: LiteLLM's `Router` supports `fallbacks` as a list of model names. Configure `fallbacks=[{"fast-model": ["capable-model"]}]` so a 429 on the cheap model automatically retries on the capable one without application-level code changes.
- **Emit budget alerts**: Add a rolling cost accumulator in `OtelCostLogger` that emits a `llm.budget.exceeded` span attribute when the hourly spend crosses a threshold. Wire a Grafana alert on that attribute.
- **Integrate with vLLM self-hosted deployments**: Replace one of the cloud provider entries with a vLLM endpoint (`api_base: http://your-vllm-host:8000`) to route between a self-hosted model and a cloud fallback. vLLM's OpenAI-compatible API surface means the router config needs no other changes [1].
- **Structured logging alongside spans**: The `log_success_event` hook can also write a JSON line to a file or a Kafka topic. Pairing structured logs with OTel traces (linked by `trace_id`) gives you both real-time alerting and historical cost analytics without a separate billing pipeline.

## FAQ

### How does the cost-aware router decide which model to use?

The router estimates the request cost from prompt token count and per-deployment rate cards, then routes to the cheaper model if the estimated cost is below a threshold. If the cheap model's cost exceeds the threshold, it falls back to the capable model because the cost difference becomes negligible relative to quality.

### What attributes does the OpenTelemetry span carry for each request?

Each span includes llm.provider, llm.model, llm.input_tokens, llm.output_tokens, llm.cost_usd, llm.latency_ms, and llm.routing_strategy. These attributes survive backend migrations intact and index the same way on Datadog or Honeycomb.

### Why use SimpleSpanProcessor instead of BatchSpanProcessor in the tutorial?

SimpleSpanProcessor flushes spans synchronously, which is necessary for short-lived scripts and for the verification step to capture spans in a StringIO buffer. In production, BatchSpanProcessor is preferred for performance.

### How do you swap from local console output to a real OTLP backend like Grafana Tempo?

Replace ConsoleSpanExporter with OTLPSpanExporter and set the endpoint to your Tempo instance (e.g., http://tempo:4317). The span structure and attributes remain unchanged, so filtering by llm.provider or llm.routing_strategy works identically in Grafana Explore.

### What happens if the estimated cost of the cheap model exceeds the cost threshold?

The router falls back to the capable model because the cost difference is small relative to quality. This prevents routing very long-context requests to an expensive cheap model when the capable model would be only slightly more expensive.

## References

1. https://github.com/vllm-project/vllm
2. https://opentelemetry.io/blog/2026/kotlin-multiplatform-opentelemetry/
