LiteLLM Proxy with vLLM Backend and OpenTelemetry Cost Attribution

Why this matters

Sovereign-cloud LLM deployments are no longer optional for operators serving EU public-sector or financial clients. The AI Act’s transparency requirements mean every inference call needs a durable, attributable record: which model, how many tokens, what cost, which tenant. Running a managed API (OpenAI, Anthropic) makes that record someone else’s problem. Running your own stack makes it yours.

vLLM provides the high-throughput inference engine. LiteLLM’s proxy layer adds a unified OpenAI-compatible API surface, per-model cost tables, and a callback hook system. OpenTelemetry ties them together with vendor-neutral span emission that any collector (Grafana Tempo, SigNoz, Jaeger) can ingest.

The gap most operators hit: LiteLLM’s built-in logging callbacks emit cost data to its own database or to LangSmith, neither of which is acceptable in a data-residency-constrained environment. This tutorial wires a custom OTel callback directly into LiteLLM’s CustomLogger interface, emitting token counts and cost as span attributes on every completion, with no data leaving your network.

Prerequisites

Python 3.11 or 3.12
An NVIDIA GPU for production vLLM use (the tutorial mocks the vLLM backend for sandbox execution; GPU notes are included for real deployments)
Basic familiarity with LiteLLM config YAML
curl for smoke-testing the proxy
Docker and Docker Compose if you want to run the full Grafana Tempo stack (optional; the tutorial uses a console exporter that runs without Docker)

Setup

Install LiteLLM, the OpenTelemetry SDK, and httpx (used by the mock backend):

uv pip install "litellm>=1.40.0" opentelemetry-sdk opentelemetry-api httpx fastapi uvicorn

Verify the key packages are present:

from importlib.metadata import version
for pkg in ["litellm", "opentelemetry-sdk", "opentelemetry-api", "fastapi", "uvicorn"]:
    print(f"{pkg}: {version(pkg)}")
print("env_check_ok")

Step 1: Build the OTel cost-attribution callback

LiteLLM’s CustomLogger exposes log_success_event and log_failure_event hooks. Each hook receives a kwargs dict containing the full request context and a response_obj with usage statistics. The callback below extracts token counts, looks up the per-token cost from LiteLLM’s built-in cost map, and records everything as span attributes.

# filename: otel_cost_logger.py
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource
from litellm.integrations.custom_logger import CustomLogger
import litellm

# ---------------------------------------------------------------------------
# Tracer setup — console exporter so every span is immediately visible.
# In production, swap ConsoleSpanExporter for an OTLPSpanExporter pointed at
# your Grafana Tempo or SigNoz collector endpoint.
# ---------------------------------------------------------------------------
_resource = Resource.create({"service.name": "litellm-proxy", "service.version": "1.0.0"})
_provider = TracerProvider(resource=_resource)
_provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(_provider)
_tracer = trace.get_tracer("litellm.cost_attribution")


class OtelCostLogger(CustomLogger):
    """Emit one OTel span per LiteLLM completion with token-cost attributes."""

    def log_success_event(self, kwargs, response_obj, start_time, end_time):
        model = kwargs.get("model", "unknown")
        call_type = kwargs.get("call_type", "completion")
        metadata = kwargs.get("litellm_params", {}).get("metadata") or {}
        tenant = metadata.get("tenant_id", "default")

        usage = getattr(response_obj, "usage", None)
        prompt_tokens = getattr(usage, "prompt_tokens", 0) or 0
        completion_tokens = getattr(usage, "completion_tokens", 0) or 0
        total_tokens = getattr(usage, "total_tokens", 0) or (prompt_tokens + completion_tokens)

        # LiteLLM ships a cost map; fall back to 0.0 if the model isn't listed.
        try:
            cost = litellm.completion_cost(completion_response=response_obj, model=model)
        except Exception:
            cost = 0.0

        duration_ms = (end_time - start_time).total_seconds() * 1000

        with _tracer.start_as_current_span(f"llm.completion.{call_type}") as span:
            span.set_attribute("llm.model", model)
            span.set_attribute("llm.call_type", call_type)
            span.set_attribute("llm.tenant_id", tenant)
            span.set_attribute("llm.prompt_tokens", prompt_tokens)
            span.set_attribute("llm.completion_tokens", completion_tokens)
            span.set_attribute("llm.total_tokens", total_tokens)
            span.set_attribute("llm.cost_usd", round(cost, 8))
            span.set_attribute("llm.duration_ms", round(duration_ms, 2))
            span.set_attribute("llm.success", True)

    def log_failure_event(self, kwargs, response_obj, start_time, end_time):
        model = kwargs.get("model", "unknown")
        error_str = str(kwargs.get("exception", "unknown error"))

        with _tracer.start_as_current_span("llm.completion.failure") as span:
            span.set_attribute("llm.model", model)
            span.set_attribute("llm.success", False)
            span.set_attribute("llm.error", error_str)


def get_tracer_provider() -> TracerProvider:
    return _provider

Step 2: Write a mock vLLM-compatible backend

In production you run vllm serve mistralai/Mistral-7B-Instruct-v0.2 --port 8001. Because the sandbox has no GPU, this step creates a FastAPI app that speaks the OpenAI /v1/chat/completions wire format. The LiteLLM proxy treats it identically to a real vLLM endpoint.

# filename: mock_vllm.py
import time, json
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse

app = FastAPI()

@app.get("/health")
def health():
    return {"status": "ok"}

@app.post("/v1/chat/completions")
async def chat_completions(request: Request):
    body = await request.json()
    model = body.get("model", "mistral-7b")
    messages = body.get("messages", [])
    last_user = next(
        (m["content"] for m in reversed(messages) if m.get("role") == "user"),
        "Hello",
    )
    reply = f"[mock-vllm] Echo: {last_user[:80]}"
    prompt_tokens = sum(len(m.get("content", "").split()) for m in messages)
    completion_tokens = len(reply.split())
    return JSONResponse({
        "id": f"chatcmpl-mock-{int(time.time())}",
        "object": "chat.completion",
        "created": int(time.time()),
        "model": model,
        "choices": [{
            "index": 0,
            "message": {"role": "assistant", "content": reply},
            "finish_reason": "stop",
        }],
        "usage": {
            "prompt_tokens": prompt_tokens,
            "completion_tokens": completion_tokens,
            "total_tokens": prompt_tokens + completion_tokens,
        },
    })

Start the mock backend as a detached process:

nohup uvicorn mock_vllm:app --host 0.0.0.0 --port 8001 > /tmp/mock_vllm.log 2>&1 & disown
sleep 3
curl -sf http://localhost:8001/health || (echo "mock vLLM failed to start" >&2; cat /tmp/mock_vllm.log; exit 1)
echo "mock_vllm_started"

Step 3: Configure LiteLLM to route to the mock vLLM backend

LiteLLM’s proxy reads a YAML config that maps model aliases to upstream providers. The openai/ prefix tells LiteLLM to use the OpenAI-compatible client, and api_base points to the local vLLM (or mock) endpoint.

# filename: litellm_config.yaml
model_list:
  - model_name: mistral-7b
    litellm_params:
      model: openai/mistral-7b
      api_base: http://localhost:8001/v1
      api_key: "not-needed"
      # Real vLLM deployment: set api_base to http://your-gpu-host:8001/v1
      # and model to openai/mistralai/Mistral-7B-Instruct-v0.2

litellm_settings:
  # Cost map entry for the mock model (USD per token)
  # For real Mistral models, LiteLLM's built-in map already has entries.
  success_callback: []
  failure_callback: []
  request_timeout: 30

general_settings:
  master_key: "sk-local-dev-key"

Step 4: Wire the callback and call the proxy programmatically

For the sandbox, drive LiteLLM directly in library mode rather than spawning the proxy server. This exercises the same callback path the proxy server uses internally.

# filename: run_proxy_demo.py
import sys, os
sys.path.insert(0, "/workspace")

import litellm
from otel_cost_logger import OtelCostLogger, get_tracer_provider

# Register the OTel callback
cost_logger = OtelCostLogger()
litellm.callbacks = [cost_logger]

# Point LiteLLM at the local mock vLLM backend
litellm.api_base = None  # reset any global default

def run_completion(prompt: str, tenant_id: str = "tenant-acme") -> str:
    response = litellm.completion(
        model="openai/mistral-7b",
        messages=[{"role": "user", "content": prompt}],
        api_base="http://localhost:8001/v1",
        api_key="not-needed",
        metadata={"tenant_id": tenant_id},
    )
    return response.choices[0].message.content

if __name__ == "__main__":
    result = run_completion("Explain token cost attribution in one sentence.", tenant_id="tenant-acme")
    print("LLM response:", result)
    # Flush any buffered spans before the process exits
    get_tracer_provider().force_flush()
    print("demo_complete")

Step 5: Run the demo and observe the spans

cd /workspace && python run_proxy_demo.py

The console exporter prints each span as a JSON-like block to stdout. Look for lines containing llm.model, llm.prompt_tokens, llm.cost_usd, and llm.tenant_id. These are the attributes your collector would index for cost dashboards.

Step 6: Understand the span schema

Every successful completion emits a span with this attribute set:

Attribute	Type	Description
`llm.model`	string	Model alias as sent to LiteLLM
`llm.call_type`	string	`completion`, `acompletion`, etc.
`llm.tenant_id`	string	Passed via `metadata.tenant_id`
`llm.prompt_tokens`	int	Tokens in the prompt
`llm.completion_tokens`	int	Tokens in the response
`llm.total_tokens`	int	Sum of prompt and completion
`llm.cost_usd`	float	Estimated cost from LiteLLM’s cost map
`llm.duration_ms`	float	Wall-clock time for the call
`llm.success`	bool	`true` on success, `false` on failure

In a production Grafana Tempo deployment, you’d query these attributes with TraceQL:

{ span.llm.tenant_id = "tenant-acme" } | sum(span.llm.cost_usd)

The same span structure indexes the same way on Datadog or Honeycomb. Only the exporter endpoint changes.

The same span structure indexes the same way on Datadog or Honeycomb. Only the exporter endpoint changes.

Step 7: Swap in a real vLLM backend (GPU deployment notes)

On a machine with an NVIDIA GPU, replace the mock backend with:

# skip_execution: GPU not available in sandbox
vllm serve mistralai/Mistral-7B-Instruct-v0.2 \
  --host 0.0.0.0 \
  --port 8001 \
  --dtype bfloat16 \
  --max-model-len 8192

No other changes are needed. The api_base in litellm_config.yaml already points to http://localhost:8001/v1. For multi-node deployments, replace localhost with the vLLM node’s internal IP.

To run the LiteLLM proxy server (rather than library mode), register the callback via the config:

# filename: litellm_config_prod.yaml
model_list:
  - model_name: mistral-7b
    litellm_params:
      model: openai/mistralai/Mistral-7B-Instruct-v0.2
      api_base: http://vllm-host:8001/v1
      api_key: "not-needed"

litellm_settings:
  callbacks:
    - otel_cost_logger.OtelCostLogger

general_settings:
  master_key: "sk-your-key-here"

Then start the proxy:

# skip_execution: requires GPU vLLM backend and production config
litellm --config litellm_config_prod.yaml --port 4000

Clients send requests to http://proxy-host:4000/v1/chat/completions with the Authorization: Bearer sk-your-key-here header. The proxy authenticates, routes to vLLM, and fires the OTel callback before returning the response.

Step 8: Connect to Grafana Tempo (optional)

If you have Docker Compose available, a minimal Tempo stack accepts OTLP on port 4317. Replace the ConsoleSpanExporter in otel_cost_logger.py with:

# Grafana Tempo OTLP exporter — replace ConsoleSpanExporter in production
# Requires: uv pip install opentelemetry-exporter-otlp-proto-grpc
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor

_otlp_exporter = OTLPSpanExporter(endpoint="http://tempo:4317", insecure=True)
_provider.add_span_processor(BatchSpanProcessor(_otlp_exporter))

The BatchSpanProcessor is appropriate for production throughput. The SimpleSpanProcessor used in the runnable demo flushes synchronously per span, which is correct for testing but adds latency under load.

Verify it works

import sys, io, json
sys.path.insert(0, "/workspace")

# Capture console exporter output
import litellm
from otel_cost_logger import OtelCostLogger, get_tracer_provider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter

captured = io.StringIO()
exporter = ConsoleSpanExporter(out=captured)

from otel_cost_logger import _provider
_provider.add_span_processor(SimpleSpanProcessor(exporter))

cost_logger = OtelCostLogger()
litellm.callbacks = [cost_logger]

response = litellm.completion(
    model="openai/mistral-7b",
    messages=[{"role": "user", "content": "ping"}],
    api_base="http://localhost:8001/v1",
    api_key="not-needed",
    metadata={"tenant_id": "verify-tenant"},
)

_provider.force_flush()

output = captured.getvalue()
assert "llm.model" in output, f"Expected llm.model in span output, got:\n{output[:500]}"
assert "llm.tenant_id" in output, f"Expected llm.tenant_id in span output"
assert "llm.prompt_tokens" in output, f"Expected llm.prompt_tokens in span output"
print("verification_passed")
print(f"Span attributes found in output ({len(output)} chars total)")

Troubleshooting

ModuleNotFoundError: No module named 'litellm.integrations.custom_logger' — This path changed in LiteLLM 1.x. Run uv pip install "litellm>=1.40.0" to ensure you have a version that exposes CustomLogger at that import path. Check with python -c "from litellm.integrations.custom_logger import CustomLogger; print('ok')".

litellm.exceptions.APIConnectionError when calling the mock backend — The mock vLLM server may not have started. Check /tmp/mock_vllm.log for startup errors. Port 8001 may already be in use: lsof -i :8001 and kill the conflicting process.

cost_usd is always 0.0 — LiteLLM’s cost map doesn’t have an entry for openai/mistral-7b (the mock model name). This is expected in the demo. For real Mistral model names like mistralai/Mistral-7B-Instruct-v0.2, the cost map has entries. You can also set a custom cost: litellm.register_model({"mistral-7b": {"input_cost_per_token": 0.0000002, "output_cost_per_token": 0.0000002}}).

Spans appear in the console but not in Tempo — Verify the OTLP exporter endpoint is reachable: curl -v http://tempo:4317. Tempo’s OTLP gRPC port is 4317 by default; the HTTP port is 4318. If you’re using the HTTP exporter (OTLPSpanExporter from opentelemetry-exporter-otlp-proto-http), point it at port 4318.

litellm.callbacks resets between calls — LiteLLM’s global callbacks list is module-level state. If you’re running in a long-lived process and callbacks disappear, another part of your code may be reassigning litellm.callbacks = []. Use litellm.callbacks.append(cost_logger) instead of assignment to avoid clobbering other registered callbacks.

The proxy server returns 401 on every request — The master_key in litellm_config.yaml must match the Authorization: Bearer <key> header sent by clients. In library mode (this tutorial’s demo), no key is required. In proxy server mode, set LITELLM_MASTER_KEY as an environment variable rather than hardcoding it in the YAML.

Next steps

Per-tenant cost budgets: LiteLLM’s BudgetManager can enforce per-tenant token budgets. Combine it with the OTel callback to emit a llm.budget_remaining_usd attribute on each span, giving your alerting system a direct signal before a tenant exhausts their quota.
Async completions: Replace litellm.completion with litellm.acompletion and implement async_log_success_event in OtelCostLogger for non-blocking callback execution in high-throughput deployments.
Semantic conventions alignment: The OpenTelemetry GenAI working group is standardizing span attribute names under the gen_ai.* namespace. Migrating llm.prompt_tokens to gen_ai.usage.input_tokens will make your spans compatible with emerging collector pipelines without changing the underlying data.
Sidecar deployment: Package otel_cost_logger.py as a pip-installable plugin and mount it into the LiteLLM proxy Docker image via a custom Dockerfile FROM ghcr.io/berriai/litellm:main. This keeps the callback versioned independently of the proxy config.

FAQ

How does the OTel callback extract token costs from LiteLLM?

The callback implements LiteLLM’s CustomLogger interface, receiving the response object and request context in log_success_event and log_failure_event hooks. It extracts prompt and completion token counts from the response usage object, then calls litellm.completion_cost() to look up the per-token cost from LiteLLM’s built-in cost map, falling back to 0.0 if the model is not listed.

What span attributes are emitted for each LLM completion?

Each successful completion emits a span with attributes including llm.model, llm.tenant_id, llm.prompt_tokens, llm.completion_tokens, llm.total_tokens, llm.cost_usd, llm.duration_ms, llm.call_type, and llm.success. Failures emit a span with llm.model, llm.success set to false, and llm.error containing the error message.

Can the stack work with a real vLLM GPU deployment instead of the mock backend?

Yes. Replace the mock FastAPI backend with vllm serve on a GPU machine, pointing to a real model like mistralai/Mistral-7B-Instruct-v0.2. Update the api_base in litellm_config.yaml to the vLLM host IP and port; no other changes are needed because LiteLLM treats both the mock and real vLLM endpoints identically via the OpenAI-compatible API.

How do you swap the console exporter for a production collector like Grafana Tempo?

Replace ConsoleSpanExporter with OTLPSpanExporter from opentelemetry-exporter-otlp-proto-grpc, pointing to your Tempo endpoint on port 4317. Use BatchSpanProcessor instead of SimpleSpanProcessor for production throughput, as SimpleSpanProcessor flushes synchronously per span and adds latency under load.

What happens if LiteLLM’s cost map does not have an entry for a model?

The callback catches the exception from litellm.completion_cost() and sets cost to 0.0. You can register custom costs using litellm.register_model() with input_cost_per_token and output_cost_per_token values, or ensure the model name matches an entry in LiteLLM’s built-in cost map.