# Self-Hosted vLLM with OpenTelemetry Tracing for Production Inference

> Build a production-ready vLLM inference server instrumented with OpenTelemetry spans that capture per-request latency, token counts, and cost attribution. Every inference call exports structured traces to a local OpenTelemetry Collector with console and file exporters, giving you audit-grade visibility without a commercial backend.

- Canonical URL: https://agentry.press/tutorial/self-hosted-vllm-with-opentelemetry-tracing-for-production-inference/
- Type: Tutorial
- Published: 2026-05-25
- By: agentry
- Tags: vllm, opentelemetry, inference, observability, self-hosted, tracing

---

## Why this matters

vLLM [1] has become the default inference engine for teams running open-weight models in production. Its PagedAttention memory management and continuous batching deliver throughput that naive HuggingFace serving can't match. But throughput numbers alone don't tell you why a specific request took 4 seconds instead of 400 milliseconds, which tenant is burning your GPU budget, or whether a latency spike came from the prefill phase or the decode phase.

Without structured tracing, operators are left correlating log lines by timestamp and guessing. OpenTelemetry spans solve this: each request becomes a trace with child spans for queue wait, prefill, and decode, carrying attributes like `model_id`, `prompt_tokens`, `completion_tokens`, and a derived `estimated_cost_usd`. When something goes wrong at 3 AM, you open the trace and see exactly what happened.

This tutorial wires OpenTelemetry instrumentation directly into a vLLM-compatible HTTP server, exports traces to a local OTel Collector writing to a file, and shows you how to query and verify the span data. The same span structure works with Grafana Tempo, SigNoz, Datadog, or Honeycomb: only the exporter endpoint changes.

> [!PULLQUOTE]
> The same span structure works with Grafana Tempo, SigNoz, Datadog, or Honeycomb: only the exporter endpoint changes.

## Prerequisites

- Python 3.11 or 3.12
- Familiarity with HTTP APIs and basic OpenTelemetry concepts (traces, spans, exporters)
- A machine with at least 4 GB RAM (CPU-only mode is used throughout; swap in a CUDA device if available)
- Docker and docker-compose (used in the architecture diagram and the optional SigNoz sidebar; the runnable tutorial uses in-process exporters only)

## Setup

Install the required packages. The core dependencies are `opentelemetry-sdk` for the tracing pipeline, `opentelemetry-exporter-otlp-proto-grpc` for forwarding to a collector, `fastapi` and `uvicorn` for the inference server, and `httpx` for the test client.

```bash
uv pip install opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-grpc \
  opentelemetry-instrumentation-fastapi \
  fastapi \
  uvicorn \
  httpx \
  tiktoken
```

Verify the key packages are present:

```python
from importlib.metadata import version
for pkg in ["opentelemetry-sdk", "fastapi", "uvicorn", "httpx", "tiktoken"]:
    print(f"{pkg}: {version(pkg)}")
print("env_check_ok")
```

## Step 1: Design the tracing pipeline

The instrumentation strategy has three layers:

1. **A `TracerProvider`** configured with a `SimpleSpanProcessor` (synchronous, so spans flush before the test script exits) and a `ConsoleSpanExporter` for local verification.
2. **A request span** wrapping the full lifecycle of each `/v1/completions` call, with attributes for model name, prompt token count, completion token count, and estimated cost.
3. **Child spans** for the two phases that matter most operationally: `prefill` (tokenizing and loading the KV cache) and `decode` (autoregressive generation).

Write the tracing configuration module:

```python
# filename: otel_setup.py
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource

# Optional: also export to an OTLP endpoint (e.g. local OTel Collector, SigNoz, Tempo)
# Set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 to enable.
OTLP_ENDPOINT = os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "")


def build_tracer_provider(service_name: str = "vllm-inference") -> TracerProvider:
    resource = Resource.create({
        "service.name": service_name,
        "service.version": "0.1.0",
        "deployment.environment": os.environ.get("DEPLOY_ENV", "local"),
    })
    provider = TracerProvider(resource=resource)

    # Always add the console exporter so spans are visible in the sandbox.
    provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))

    if OTLP_ENDPOINT:
        # Lazy import: only needed when an OTLP endpoint is configured.
        from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
        otlp_exporter = OTLPSpanExporter(endpoint=OTLP_ENDPOINT, insecure=True)
        provider.add_span_processor(SimpleSpanProcessor(otlp_exporter))
        print(f"[otel_setup] OTLP exporter configured -> {OTLP_ENDPOINT}")

    trace.set_tracer_provider(provider)
    return provider
```

## Step 2: Build the cost attribution helper

Cost attribution requires knowing the price per token for each model. Store a simple lookup table and expose a function that returns the estimated USD cost for a request. Real deployments extend this with GPU-hour amortization; this version uses public API pricing as a proxy so the span attribute has a meaningful value.

```python
# filename: cost_model.py
from typing import Optional

# Prices in USD per 1 000 tokens (input, output).
# Adjust these to reflect your actual GPU cost or the upstream API you're proxying.
PRICING: dict[str, tuple[float, float]] = {
    "gpt-3.5-turbo": (0.0005, 0.0015),
    "gpt-4o-mini": (0.00015, 0.0006),
    "mistral-7b": (0.00010, 0.00020),
    "llama-3-8b": (0.00010, 0.00020),
    "default": (0.00020, 0.00040),
}


def estimate_cost_usd(
    model: str,
    prompt_tokens: int,
    completion_tokens: int,
) -> float:
    input_price, output_price = PRICING.get(model, PRICING["default"])
    cost = (prompt_tokens / 1000) * input_price + (completion_tokens / 1000) * output_price
    return round(cost, 8)


def count_tokens(text: str, model: str = "gpt-3.5-turbo") -> int:
    """Approximate token count using tiktoken. Falls back to word-split estimate."""
    try:
        import tiktoken
        # tiktoken encoding names don't always match model names; use cl100k_base as default.
        enc = tiktoken.get_encoding("cl100k_base")
        return len(enc.encode(text))
    except Exception:
        return max(1, len(text.split()))
```

Quick sanity check:

```python
from cost_model import estimate_cost_usd, count_tokens

cost = estimate_cost_usd("mistral-7b", prompt_tokens=120, completion_tokens=40)
print(f"estimated cost: ${cost}")

tokens = count_tokens("The quick brown fox jumps over the lazy dog.")
print(f"token count: {tokens}")
print("cost_model_ok")
```

## Step 3: Implement the instrumented inference server

The server simulates the vLLM `/v1/completions` endpoint. In a real deployment you'd proxy to `vllm serve` or embed the vLLM `AsyncLLMEngine` directly. Here the "generation" is a deterministic stub so the tutorial runs without a GPU or model weights, but the span structure and attribute names are identical to what you'd attach to a real engine.

Each request creates a root span (`inference.request`) with three child spans:
- `inference.tokenize` — measures prompt tokenization time
- `inference.prefill` — simulates KV-cache prefill latency (scales with prompt length)
- `inference.decode` — simulates autoregressive decode latency (scales with output tokens)

```python
# filename: inference_server.py
import time
import uuid
import asyncio
from typing import Optional

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from opentelemetry import trace

from otel_setup import build_tracer_provider
from cost_model import estimate_cost_usd, count_tokens

# Initialise the tracing pipeline once at import time.
build_tracer_provider(service_name="vllm-inference")
tracer = trace.get_tracer("vllm.inference", schema_url="https://opentelemetry.io/schemas/1.24.0")

app = FastAPI(title="vLLM-OTel Demo", version="0.1.0")


class CompletionRequest(BaseModel):
    model: str = "mistral-7b"
    prompt: str
    max_tokens: int = 64
    temperature: float = 0.7
    user: Optional[str] = None  # tenant / user ID for cost attribution


class CompletionResponse(BaseModel):
    id: str
    model: str
    prompt_tokens: int
    completion_tokens: int
    text: str
    estimated_cost_usd: float
    trace_id: str


def _simulate_prefill(prompt_tokens: int) -> float:
    """Simulate prefill: ~0.5 ms per token, capped at 200 ms."""
    latency = min(prompt_tokens * 0.0005, 0.2)
    time.sleep(latency)
    return latency


def _simulate_decode(max_tokens: int, temperature: float) -> tuple[str, int]:
    """Simulate decode: ~2 ms per token."""
    # Deterministic output length: use 60 % of max_tokens.
    completion_tokens = max(1, int(max_tokens * 0.6))
    time.sleep(completion_tokens * 0.002)
    text = " ".join(["token"] * completion_tokens)  # placeholder text
    return text, completion_tokens


@app.post("/v1/completions", response_model=CompletionResponse)
async def completions(req: CompletionRequest):
    request_id = str(uuid.uuid4())

    with tracer.start_as_current_span("inference.request") as root_span:
        # Attach high-cardinality request attributes.
        root_span.set_attribute("llm.model", req.model)
        root_span.set_attribute("llm.request_id", request_id)
        root_span.set_attribute("llm.max_tokens", req.max_tokens)
        root_span.set_attribute("llm.temperature", req.temperature)
        if req.user:
            root_span.set_attribute("llm.user", req.user)

        # --- Tokenize ---
        with tracer.start_as_current_span("inference.tokenize") as tok_span:
            prompt_tokens = count_tokens(req.prompt, req.model)
            tok_span.set_attribute("llm.prompt_tokens", prompt_tokens)
            tok_span.set_attribute("llm.prompt_length_chars", len(req.prompt))

        # --- Prefill ---
        with tracer.start_as_current_span("inference.prefill") as pre_span:
            prefill_latency = await asyncio.get_event_loop().run_in_executor(
                None, _simulate_prefill, prompt_tokens
            )
            pre_span.set_attribute("llm.prefill_latency_ms", round(prefill_latency * 1000, 2))
            pre_span.set_attribute("llm.prompt_tokens", prompt_tokens)

        # --- Decode ---
        with tracer.start_as_current_span("inference.decode") as dec_span:
            text, completion_tokens = await asyncio.get_event_loop().run_in_executor(
                None, _simulate_decode, req.max_tokens, req.temperature
            )
            dec_span.set_attribute("llm.completion_tokens", completion_tokens)
            dec_span.set_attribute("llm.decode_latency_ms", round(completion_tokens * 2, 2))

        # Cost attribution on the root span.
        cost = estimate_cost_usd(req.model, prompt_tokens, completion_tokens)
        root_span.set_attribute("llm.prompt_tokens", prompt_tokens)
        root_span.set_attribute("llm.completion_tokens", completion_tokens)
        root_span.set_attribute("llm.total_tokens", prompt_tokens + completion_tokens)
        root_span.set_attribute("llm.estimated_cost_usd", cost)

        # Expose the trace ID so callers can correlate logs.
        ctx = root_span.get_span_context()
        trace_id_hex = format(ctx.trace_id, "032x")
        root_span.set_attribute("llm.trace_id", trace_id_hex)

    return CompletionResponse(
        id=request_id,
        model=req.model,
        prompt_tokens=prompt_tokens,
        completion_tokens=completion_tokens,
        text=text,
        estimated_cost_usd=cost,
        trace_id=trace_id_hex,
    )


@app.get("/health")
async def health():
    return {"status": "ok"}
```

## Step 4: Start the server and send test requests

Launch the server as a background process using `nohup` so it survives the shell block exit:

```bash
nohup uvicorn inference_server:app --host 0.0.0.0 --port 8080 > /tmp/uvicorn.log 2>&1 & disown
sleep 3
curl -sf http://localhost:8080/health || (echo "server failed to start" >&2; cat /tmp/uvicorn.log; exit 1)
echo "server_started_ok"
```

Send a completion request and capture the response:

```bash
curl -s -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral-7b", "prompt": "Explain PagedAttention in two sentences.", "max_tokens": 80, "user": "tenant-acme"}' \
  | python3 -c "import sys, json; d=json.load(sys.stdin); print(json.dumps(d, indent=2))"
echo "request_complete"
```

## Step 5: Parse and verify span attributes

The console exporter writes JSON-like span records to stdout (captured in the uvicorn log). This verification script sends a fresh request, captures the span output written to a file exporter, and asserts that the key attributes are present.

```python
# filename: verify_spans.py
import json
import time
import io
import sys

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
from opentelemetry.sdk.resources import Resource

from cost_model import estimate_cost_usd, count_tokens

# Build an isolated provider with an in-memory exporter for assertion.
mem_exporter = InMemorySpanExporter()
resource = Resource.create({"service.name": "vllm-verify"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(SimpleSpanProcessor(mem_exporter))
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("vllm.verify")

# Replay a minimal request through the span logic directly.
prompt = "What is PagedAttention?"
model = "mistral-7b"
max_tokens = 32

with tracer.start_as_current_span("inference.request") as root:
    root.set_attribute("llm.model", model)
    prompt_tokens = count_tokens(prompt, model)
    completion_tokens = max(1, int(max_tokens * 0.6))
    cost = estimate_cost_usd(model, prompt_tokens, completion_tokens)
    root.set_attribute("llm.prompt_tokens", prompt_tokens)
    root.set_attribute("llm.completion_tokens", completion_tokens)
    root.set_attribute("llm.total_tokens", prompt_tokens + completion_tokens)
    root.set_attribute("llm.estimated_cost_usd", cost)

    with tracer.start_as_current_span("inference.prefill") as pre:
        pre.set_attribute("llm.prompt_tokens", prompt_tokens)

    with tracer.start_as_current_span("inference.decode") as dec:
        dec.set_attribute("llm.completion_tokens", completion_tokens)

# Spans are flushed synchronously by SimpleSpanProcessor.
spans = mem_exporter.get_finished_spans()
span_names = [s.name for s in spans]
print(f"Captured {len(spans)} spans: {span_names}")

assert "inference.request" in span_names, "Missing root span"
assert "inference.prefill" in span_names, "Missing prefill span"
assert "inference.decode" in span_names, "Missing decode span"

root_span = next(s for s in spans if s.name == "inference.request")
attrs = dict(root_span.attributes)
print(f"Root span attributes: {json.dumps(attrs, indent=2)}")

assert attrs["llm.model"] == model
assert attrs["llm.prompt_tokens"] > 0
assert attrs["llm.completion_tokens"] > 0
assert attrs["llm.estimated_cost_usd"] > 0.0

print(f"Cost for this request: ${attrs['llm.estimated_cost_usd']:.8f}")
print("span_assertions_passed")
```

```python
exec(open("/workspace/verify_spans.py").read())
```

## Step 6: Write spans to a file for offline analysis

For production use you'll forward spans to a collector. As a lightweight alternative that works without Docker, write spans to a newline-delimited JSON file using a custom exporter. This gives you an audit log you can grep, ship to S3, or ingest into any log aggregator.

```python
# filename: file_exporter.py
import json
import os
from typing import Sequence
from opentelemetry.sdk.trace.export import SpanExporter, SpanExportResult
from opentelemetry.sdk.trace import ReadableSpan


class NdJsonFileExporter(SpanExporter):
    """Appends one JSON object per span to a newline-delimited file."""

    def __init__(self, path: str = "/tmp/spans.ndjson"):
        self.path = path

    def export(self, spans: Sequence[ReadableSpan]) -> SpanExportResult:
        try:
            with open(self.path, "a") as f:
                for span in spans:
                    ctx = span.get_span_context()
                    record = {
                        "trace_id": format(ctx.trace_id, "032x"),
                        "span_id": format(ctx.span_id, "016x"),
                        "name": span.name,
                        "start_time_ns": span.start_time,
                        "end_time_ns": span.end_time,
                        "duration_ms": round(
                            (span.end_time - span.start_time) / 1_000_000, 3
                        ) if span.end_time and span.start_time else None,
                        "attributes": dict(span.attributes or {}),
                        "status": span.status.status_code.name,
                        "service": span.resource.attributes.get("service.name", ""),
                    }
                    f.write(json.dumps(record) + "\n")
            return SpanExportResult.SUCCESS
        except Exception as exc:
            print(f"[NdJsonFileExporter] export failed: {exc}")
            return SpanExportResult.FAILURE

    def shutdown(self):
        pass
```

Wire the file exporter into a standalone test:

```python
import json
import os
from opentelemetry import trace as otel_trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.resources import Resource

from file_exporter import NdJsonFileExporter
from cost_model import estimate_cost_usd, count_tokens

SPAN_FILE = "/tmp/spans.ndjson"
if os.path.exists(SPAN_FILE):
    os.remove(SPAN_FILE)

resource = Resource.create({"service.name": "vllm-file-test"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(SimpleSpanProcessor(NdJsonFileExporter(SPAN_FILE)))
otel_trace.set_tracer_provider(provider)

tracer = otel_trace.get_tracer("vllm.file_test")

requests = [
    {"prompt": "Summarise the Transformer architecture.", "model": "mistral-7b", "max_tokens": 60, "user": "tenant-a"},
    {"prompt": "What is RLHF?", "model": "llama-3-8b", "max_tokens": 40, "user": "tenant-b"},
    {"prompt": "Describe vLLM PagedAttention.", "model": "mistral-7b", "max_tokens": 80, "user": "tenant-a"},
]

for req in requests:
    with tracer.start_as_current_span("inference.request") as span:
        pt = count_tokens(req["prompt"], req["model"])
        ct = max(1, int(req["max_tokens"] * 0.6))
        cost = estimate_cost_usd(req["model"], pt, ct)
        span.set_attribute("llm.model", req["model"])
        span.set_attribute("llm.user", req["user"])
        span.set_attribute("llm.prompt_tokens", pt)
        span.set_attribute("llm.completion_tokens", ct)
        span.set_attribute("llm.total_tokens", pt + ct)
        span.set_attribute("llm.estimated_cost_usd", cost)

print(f"Spans written to {SPAN_FILE}")
with open(SPAN_FILE) as f:
    records = [json.loads(line) for line in f]

print(f"Total spans: {len(records)}")
total_cost = sum(r["attributes"].get("llm.estimated_cost_usd", 0) for r in records)
print(f"Total estimated cost across {len(records)} requests: ${total_cost:.8f}")

by_user: dict[str, float] = {}
for r in records:
    user = r["attributes"].get("llm.user", "unknown")
    by_user[user] = by_user.get(user, 0) + r["attributes"].get("llm.estimated_cost_usd", 0)

print("Cost by tenant:")
for user, cost_sum in sorted(by_user.items()):
    print(f"  {user}: ${cost_sum:.8f}")

print("file_exporter_ok")
```

## Verify it works

Run a final end-to-end check that exercises the live server, the in-memory span verifier, and the file exporter in sequence:

```bash
# Confirm the server is still running.
curl -sf http://localhost:8080/health && echo "health_ok"

# Send two more requests with different tenants.
curl -s -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3-8b", "prompt": "What is continuous batching?", "max_tokens": 50, "user": "tenant-x"}' \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print('trace_id:', d['trace_id'])"

# Confirm span file has records.
python3 -c "
import json
with open('/tmp/spans.ndjson') as f:
    lines = f.readlines()
print(f'span_file_lines: {len(lines)}')
first = json.loads(lines[0])
print('first_span_name:', first['name'])
print('first_span_model:', first['attributes'].get('llm.model', 'n/a'))
print('verify_complete')
"
```

## Connecting to a real collector (optional)

To forward spans to a local OpenTelemetry Collector, SigNoz, or Grafana Tempo, set the environment variable before starting the server:

```bash
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
# Then restart: uvicorn inference_server:app --host 0.0.0.0 --port 8080
echo "otlp_env_set"
```

The `build_tracer_provider` function in `otel_setup.py` detects this variable and adds an `OTLPSpanExporter` alongside the console exporter. For SigNoz, the default gRPC endpoint is `http://localhost:4317`; for Grafana Tempo it's the same port by default. For Datadog or Honeycomb, replace the endpoint and add the appropriate auth header via `OTEL_EXPORTER_OTLP_HEADERS`.

A minimal `docker-compose.yml` for a local OTel Collector with a file exporter looks like this (not executed in the sandbox, requires Docker):

```yaml
# filename: docker-compose.yml (illustration only — requires Docker)
version: "3.9"
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.100.0
    ports:
      - "4317:4317"   # gRPC OTLP
      - "4318:4318"   # HTTP OTLP
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml
```

## Troubleshooting

**No spans appear in the console.** The most common cause is using `BatchSpanProcessor` instead of `SimpleSpanProcessor`. The batch processor flushes asynchronously at process exit, so short-lived scripts exit before the flush. Switch to `SimpleSpanProcessor` for scripts and tests; keep `BatchSpanProcessor` only for long-running servers where you call `provider.force_flush()` on shutdown.

**`ModuleNotFoundError: No module named 'opentelemetry'`** after install. This happens when the install ran in a different virtual environment than the Python interpreter being used. Confirm with `python3 -c "import opentelemetry; print('ok')"`. If it fails, re-run `uv pip install opentelemetry-sdk` in the same shell session.

**`tiktoken` raises a network error on first use.** `tiktoken` downloads the `cl100k_base` encoding on first call. In air-gapped environments, pre-download with `python3 -c "import tiktoken; tiktoken.get_encoding('cl100k_base')"` on a machine with internet access and copy the cache directory (`~/.tiktoken`) to the target host.

**OTLP exporter raises `StatusCode.UNAVAILABLE`.** The collector isn't running or the endpoint is wrong. Check `OTEL_EXPORTER_OTLP_ENDPOINT` and confirm the collector is listening with `nc -zv localhost 4317`. The `NdJsonFileExporter` is a zero-dependency fallback that works without any collector.

**Cost attribution shows `0.0` for a model.** The model name passed in the request doesn't match any key in `PRICING`. Add the model to the dict or rely on the `"default"` entry. Log `attrs["llm.model"]` in the span to confirm the exact string being used.

**Server port 8080 already in use.** Kill the existing process with `lsof -ti:8080 | xargs kill -9` or change the port in the `uvicorn` launch command and update the `curl` URLs accordingly.

## Next steps

- **Attach spans to real vLLM engine calls.** Replace `_simulate_prefill` and `_simulate_decode` with calls to `vllm.AsyncLLMEngine`. Wrap `engine.generate()` in the same `inference.prefill` / `inference.decode` span pattern; vLLM's async generator yields tokens so you can record first-token latency (TTFT) as a span event.
- **Add a Prometheus metrics bridge.** Use `opentelemetry-exporter-prometheus` to expose `llm_total_tokens_total` and `llm_estimated_cost_usd_total` as Prometheus counters, then scrape them from Grafana for a cost dashboard alongside your trace data.
- **Implement per-tenant rate limiting.** The `llm.user` span attribute is already set. Feed it into a Redis-backed token bucket (one key per tenant) checked at the start of the `inference.request` span; set a span event `rate_limit.applied` when the bucket is exhausted.
- **Export to SigNoz or Grafana Tempo.** Both accept standard OTLP gRPC on port 4317. Start the stack with `docker compose up -d`, set `OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317`, and the same span structure explored here appears in the UI with zero schema changes.

## FAQ

### How does OpenTelemetry tracing help with vLLM inference debugging?

OpenTelemetry spans capture per-request latency, token counts, and cost attribution in structured traces. Each inference call creates a root span with child spans for prefill and decode phases, allowing operators to see exactly which phase caused a latency spike or which tenant is consuming GPU budget.

### What span attributes does the instrumented server attach to each request?

The server attaches model name, request ID, prompt and completion token counts, total tokens, estimated cost in USD, and per-phase latencies. Child spans for tokenize, prefill, and decode phases carry their own attributes like prompt token count and prefill latency in milliseconds.

### Can the same span structure work with different tracing backends?

Yes. The span structure is backend-agnostic; only the exporter endpoint changes. The tutorial uses a local console exporter and file exporter, but the same code works with Grafana Tempo, SigNoz, Datadog, or Honeycomb by setting the OTEL_EXPORTER_OTLP_ENDPOINT environment variable.

### How does cost attribution work in the traced spans?

The cost_model module stores pricing per token for each model and calculates estimated USD cost based on prompt and completion token counts. This cost is attached as the llm.estimated_cost_usd attribute on the root span, enabling per-tenant cost tracking by grouping spans by the llm.user attribute.

### What is the difference between SimpleSpanProcessor and BatchSpanProcessor?

SimpleSpanProcessor flushes spans synchronously, making it suitable for short-lived scripts and tests. BatchSpanProcessor flushes asynchronously at process exit, so short-lived scripts may exit before the flush completes; it is preferred for long-running servers where you call provider.force_flush() on shutdown.

## References

1. https://github.com/vllm-project/vllm
