Self-Hosted vLLM with OpenTelemetry Tracing for Production Inference

Why this matters

vLLM [1] has become the default inference engine for teams running open-weight models in production. Its PagedAttention memory management and continuous batching deliver throughput that naive HuggingFace serving can’t match. But throughput numbers alone don’t tell you why a specific request took 4 seconds instead of 400 milliseconds, which tenant is burning your GPU budget, or whether a latency spike came from the prefill phase or the decode phase.

Without structured tracing, operators are left correlating log lines by timestamp and guessing. OpenTelemetry spans solve this: each request becomes a trace with child spans for queue wait, prefill, and decode, carrying attributes like model_id, prompt_tokens, completion_tokens, and a derived estimated_cost_usd. When something goes wrong at 3 AM, you open the trace and see exactly what happened.

This tutorial wires OpenTelemetry instrumentation directly into a vLLM-compatible HTTP server, exports traces to a local OTel Collector writing to a file, and shows you how to query and verify the span data. The same span structure works with Grafana Tempo, SigNoz, Datadog, or Honeycomb: only the exporter endpoint changes.

The same span structure works with Grafana Tempo, SigNoz, Datadog, or Honeycomb: only the exporter endpoint changes.

Prerequisites

Python 3.11 or 3.12
Familiarity with HTTP APIs and basic OpenTelemetry concepts (traces, spans, exporters)
A machine with at least 4 GB RAM (CPU-only mode is used throughout; swap in a CUDA device if available)
Docker and docker-compose (used in the architecture diagram and the optional SigNoz sidebar; the runnable tutorial uses in-process exporters only)

Setup

Install the required packages. The core dependencies are opentelemetry-sdk for the tracing pipeline, opentelemetry-exporter-otlp-proto-grpc for forwarding to a collector, fastapi and uvicorn for the inference server, and httpx for the test client.

uv pip install opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-grpc \
  opentelemetry-instrumentation-fastapi \
  fastapi \
  uvicorn \
  httpx \
  tiktoken

Verify the key packages are present:

from importlib.metadata import version
for pkg in ["opentelemetry-sdk", "fastapi", "uvicorn", "httpx", "tiktoken"]:
    print(f"{pkg}: {version(pkg)}")
print("env_check_ok")

Step 1: Design the tracing pipeline

The instrumentation strategy has three layers:

A TracerProvider configured with a SimpleSpanProcessor (synchronous, so spans flush before the test script exits) and a ConsoleSpanExporter for local verification.
A request span wrapping the full lifecycle of each /v1/completions call, with attributes for model name, prompt token count, completion token count, and estimated cost.
Child spans for the two phases that matter most operationally: prefill (tokenizing and loading the KV cache) and decode (autoregressive generation).

Write the tracing configuration module:

# filename: otel_setup.py
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource

# Optional: also export to an OTLP endpoint (e.g. local OTel Collector, SigNoz, Tempo)
# Set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 to enable.
OTLP_ENDPOINT = os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "")


def build_tracer_provider(service_name: str = "vllm-inference") -> TracerProvider:
    resource = Resource.create({
        "service.name": service_name,
        "service.version": "0.1.0",
        "deployment.environment": os.environ.get("DEPLOY_ENV", "local"),
    })
    provider = TracerProvider(resource=resource)

    # Always add the console exporter so spans are visible in the sandbox.
    provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))

    if OTLP_ENDPOINT:
        # Lazy import: only needed when an OTLP endpoint is configured.
        from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
        otlp_exporter = OTLPSpanExporter(endpoint=OTLP_ENDPOINT, insecure=True)
        provider.add_span_processor(SimpleSpanProcessor(otlp_exporter))
        print(f"[otel_setup] OTLP exporter configured -> {OTLP_ENDPOINT}")

    trace.set_tracer_provider(provider)
    return provider

Step 2: Build the cost attribution helper

Cost attribution requires knowing the price per token for each model. Store a simple lookup table and expose a function that returns the estimated USD cost for a request. Real deployments extend this with GPU-hour amortization; this version uses public API pricing as a proxy so the span attribute has a meaningful value.

# filename: cost_model.py
from typing import Optional

# Prices in USD per 1 000 tokens (input, output).
# Adjust these to reflect your actual GPU cost or the upstream API you're proxying.
PRICING: dict[str, tuple[float, float]] = {
    "gpt-3.5-turbo": (0.0005, 0.0015),
    "gpt-4o-mini": (0.00015, 0.0006),
    "mistral-7b": (0.00010, 0.00020),
    "llama-3-8b": (0.00010, 0.00020),
    "default": (0.00020, 0.00040),
}


def estimate_cost_usd(
    model: str,
    prompt_tokens: int,
    completion_tokens: int,
) -> float:
    input_price, output_price = PRICING.get(model, PRICING["default"])
    cost = (prompt_tokens / 1000) * input_price + (completion_tokens / 1000) * output_price
    return round(cost, 8)


def count_tokens(text: str, model: str = "gpt-3.5-turbo") -> int:
    """Approximate token count using tiktoken. Falls back to word-split estimate."""
    try:
        import tiktoken
        # tiktoken encoding names don't always match model names; use cl100k_base as default.
        enc = tiktoken.get_encoding("cl100k_base")
        return len(enc.encode(text))
    except Exception:
        return max(1, len(text.split()))

Quick sanity check:

from cost_model import estimate_cost_usd, count_tokens

cost = estimate_cost_usd("mistral-7b", prompt_tokens=120, completion_tokens=40)
print(f"estimated cost: ${cost}")

tokens = count_tokens("The quick brown fox jumps over the lazy dog.")
print(f"token count: {tokens}")
print("cost_model_ok")

Step 3: Implement the instrumented inference server

The server simulates the vLLM /v1/completions endpoint. In a real deployment you’d proxy to vllm serve or embed the vLLM AsyncLLMEngine directly. Here the “generation” is a deterministic stub so the tutorial runs without a GPU or model weights, but the span structure and attribute names are identical to what you’d attach to a real engine.

Each request creates a root span (inference.request) with three child spans:

inference.tokenize — measures prompt tokenization time
inference.prefill — simulates KV-cache prefill latency (scales with prompt length)
inference.decode — simulates autoregressive decode latency (scales with output tokens)

# filename: inference_server.py
import time
import uuid
import asyncio
from typing import Optional

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from opentelemetry import trace

from otel_setup import build_tracer_provider
from cost_model import estimate_cost_usd, count_tokens

# Initialise the tracing pipeline once at import time.
build_tracer_provider(service_name="vllm-inference")
tracer = trace.get_tracer("vllm.inference", schema_url="https://opentelemetry.io/schemas/1.24.0")

app = FastAPI(title="vLLM-OTel Demo", version="0.1.0")


class CompletionRequest(BaseModel):
    model: str = "mistral-7b"
    prompt: str
    max_tokens: int = 64
    temperature: float = 0.7
    user: Optional[str] = None  # tenant / user ID for cost attribution


class CompletionResponse(BaseModel):
    id: str
    model: str
    prompt_tokens: int
    completion_tokens: int
    text: str
    estimated_cost_usd: float
    trace_id: str


def _simulate_prefill(prompt_tokens: int) -> float:
    """Simulate prefill: ~0.5 ms per token, capped at 200 ms."""
    latency = min(prompt_tokens * 0.0005, 0.2)
    time.sleep(latency)
    return latency


def _simulate_decode(max_tokens: int, temperature: float) -> tuple[str, int]:
    """Simulate decode: ~2 ms per token."""
    # Deterministic output length: use 60 % of max_tokens.
    completion_tokens = max(1, int(max_tokens * 0.6))
    time.sleep(completion_tokens * 0.002)
    text = " ".join(["token"] * completion_tokens)  # placeholder text
    return text, completion_tokens


@app.post("/v1/completions", response_model=CompletionResponse)
async def completions(req: CompletionRequest):
    request_id = str(uuid.uuid4())

    with tracer.start_as_current_span("inference.request") as root_span:
        # Attach high-cardinality request attributes.
        root_span.set_attribute("llm.model", req.model)
        root_span.set_attribute("llm.request_id", request_id)
        root_span.set_attribute("llm.max_tokens", req.max_tokens)
        root_span.set_attribute("llm.temperature", req.temperature)
        if req.user:
            root_span.set_attribute("llm.user", req.user)

        # --- Tokenize ---
        with tracer.start_as_current_span("inference.tokenize") as tok_span:
            prompt_tokens = count_tokens(req.prompt, req.model)
            tok_span.set_attribute("llm.prompt_tokens", prompt_tokens)
            tok_span.set_attribute("llm.prompt_length_chars", len(req.prompt))

        # --- Prefill ---
        with tracer.start_as_current_span("inference.prefill") as pre_span:
            prefill_latency = await asyncio.get_event_loop().run_in_executor(
                None, _simulate_prefill, prompt_tokens
            )
            pre_span.set_attribute("llm.prefill_latency_ms", round(prefill_latency * 1000, 2))
            pre_span.set_attribute("llm.prompt_tokens", prompt_tokens)

        # --- Decode ---
        with tracer.start_as_current_span("inference.decode") as dec_span:
            text, completion_tokens = await asyncio.get_event_loop().run_in_executor(
                None, _simulate_decode, req.max_tokens, req.temperature
            )
            dec_span.set_attribute("llm.completion_tokens", completion_tokens)
            dec_span.set_attribute("llm.decode_latency_ms", round(completion_tokens * 2, 2))

        # Cost attribution on the root span.
        cost = estimate_cost_usd(req.model, prompt_tokens, completion_tokens)
        root_span.set_attribute("llm.prompt_tokens", prompt_tokens)
        root_span.set_attribute("llm.completion_tokens", completion_tokens)
        root_span.set_attribute("llm.total_tokens", prompt_tokens + completion_tokens)
        root_span.set_attribute("llm.estimated_cost_usd", cost)

        # Expose the trace ID so callers can correlate logs.
        ctx = root_span.get_span_context()
        trace_id_hex = format(ctx.trace_id, "032x")
        root_span.set_attribute("llm.trace_id", trace_id_hex)

    return CompletionResponse(
        id=request_id,
        model=req.model,
        prompt_tokens=prompt_tokens,
        completion_tokens=completion_tokens,
        text=text,
        estimated_cost_usd=cost,
        trace_id=trace_id_hex,
    )


@app.get("/health")
async def health():
    return {"status": "ok"}

Step 4: Start the server and send test requests

Launch the server as a background process using nohup so it survives the shell block exit:

nohup uvicorn inference_server:app --host 0.0.0.0 --port 8080 > /tmp/uvicorn.log 2>&1 & disown
sleep 3
curl -sf http://localhost:8080/health || (echo "server failed to start" >&2; cat /tmp/uvicorn.log; exit 1)
echo "server_started_ok"

Send a completion request and capture the response:

curl -s -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mistral-7b", "prompt": "Explain PagedAttention in two sentences.", "max_tokens": 80, "user": "tenant-acme"}' \
  | python3 -c "import sys, json; d=json.load(sys.stdin); print(json.dumps(d, indent=2))"
echo "request_complete"

Step 5: Parse and verify span attributes

The console exporter writes JSON-like span records to stdout (captured in the uvicorn log). This verification script sends a fresh request, captures the span output written to a file exporter, and asserts that the key attributes are present.

# filename: verify_spans.py
import json
import time
import io
import sys

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
from opentelemetry.sdk.resources import Resource

from cost_model import estimate_cost_usd, count_tokens

# Build an isolated provider with an in-memory exporter for assertion.
mem_exporter = InMemorySpanExporter()
resource = Resource.create({"service.name": "vllm-verify"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(SimpleSpanProcessor(mem_exporter))
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("vllm.verify")

# Replay a minimal request through the span logic directly.
prompt = "What is PagedAttention?"
model = "mistral-7b"
max_tokens = 32

with tracer.start_as_current_span("inference.request") as root:
    root.set_attribute("llm.model", model)
    prompt_tokens = count_tokens(prompt, model)
    completion_tokens = max(1, int(max_tokens * 0.6))
    cost = estimate_cost_usd(model, prompt_tokens, completion_tokens)
    root.set_attribute("llm.prompt_tokens", prompt_tokens)
    root.set_attribute("llm.completion_tokens", completion_tokens)
    root.set_attribute("llm.total_tokens", prompt_tokens + completion_tokens)
    root.set_attribute("llm.estimated_cost_usd", cost)

    with tracer.start_as_current_span("inference.prefill") as pre:
        pre.set_attribute("llm.prompt_tokens", prompt_tokens)

    with tracer.start_as_current_span("inference.decode") as dec:
        dec.set_attribute("llm.completion_tokens", completion_tokens)

# Spans are flushed synchronously by SimpleSpanProcessor.
spans = mem_exporter.get_finished_spans()
span_names = [s.name for s in spans]
print(f"Captured {len(spans)} spans: {span_names}")

assert "inference.request" in span_names, "Missing root span"
assert "inference.prefill" in span_names, "Missing prefill span"
assert "inference.decode" in span_names, "Missing decode span"

root_span = next(s for s in spans if s.name == "inference.request")
attrs = dict(root_span.attributes)
print(f"Root span attributes: {json.dumps(attrs, indent=2)}")

assert attrs["llm.model"] == model
assert attrs["llm.prompt_tokens"] > 0
assert attrs["llm.completion_tokens"] > 0
assert attrs["llm.estimated_cost_usd"] > 0.0

print(f"Cost for this request: ${attrs['llm.estimated_cost_usd']:.8f}")
print("span_assertions_passed")

exec(open("/workspace/verify_spans.py").read())

Step 6: Write spans to a file for offline analysis

For production use you’ll forward spans to a collector. As a lightweight alternative that works without Docker, write spans to a newline-delimited JSON file using a custom exporter. This gives you an audit log you can grep, ship to S3, or ingest into any log aggregator.

# filename: file_exporter.py
import json
import os
from typing import Sequence
from opentelemetry.sdk.trace.export import SpanExporter, SpanExportResult
from opentelemetry.sdk.trace import ReadableSpan


class NdJsonFileExporter(SpanExporter):
    """Appends one JSON object per span to a newline-delimited file."""

    def __init__(self, path: str = "/tmp/spans.ndjson"):
        self.path = path

    def export(self, spans: Sequence[ReadableSpan]) -> SpanExportResult:
        try:
            with open(self.path, "a") as f:
                for span in spans:
                    ctx = span.get_span_context()
                    record = {
                        "trace_id": format(ctx.trace_id, "032x"),
                        "span_id": format(ctx.span_id, "016x"),
                        "name": span.name,
                        "start_time_ns": span.start_time,
                        "end_time_ns": span.end_time,
                        "duration_ms": round(
                            (span.end_time - span.start_time) / 1_000_000, 3
                        ) if span.end_time and span.start_time else None,
                        "attributes": dict(span.attributes or {}),
                        "status": span.status.status_code.name,
                        "service": span.resource.attributes.get("service.name", ""),
                    }
                    f.write(json.dumps(record) + "\n")
            return SpanExportResult.SUCCESS
        except Exception as exc:
            print(f"[NdJsonFileExporter] export failed: {exc}")
            return SpanExportResult.FAILURE

    def shutdown(self):
        pass

Wire the file exporter into a standalone test:

import json
import os
from opentelemetry import trace as otel_trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.resources import Resource

from file_exporter import NdJsonFileExporter
from cost_model import estimate_cost_usd, count_tokens

SPAN_FILE = "/tmp/spans.ndjson"
if os.path.exists(SPAN_FILE):
    os.remove(SPAN_FILE)

resource = Resource.create({"service.name": "vllm-file-test"})
provider = TracerProvider(resource=resource)
provider.add_span_processor(SimpleSpanProcessor(NdJsonFileExporter(SPAN_FILE)))
otel_trace.set_tracer_provider(provider)

tracer = otel_trace.get_tracer("vllm.file_test")

requests = [
    {"prompt": "Summarise the Transformer architecture.", "model": "mistral-7b", "max_tokens": 60, "user": "tenant-a"},
    {"prompt": "What is RLHF?", "model": "llama-3-8b", "max_tokens": 40, "user": "tenant-b"},
    {"prompt": "Describe vLLM PagedAttention.", "model": "mistral-7b", "max_tokens": 80, "user": "tenant-a"},
]

for req in requests:
    with tracer.start_as_current_span("inference.request") as span:
        pt = count_tokens(req["prompt"], req["model"])
        ct = max(1, int(req["max_tokens"] * 0.6))
        cost = estimate_cost_usd(req["model"], pt, ct)
        span.set_attribute("llm.model", req["model"])
        span.set_attribute("llm.user", req["user"])
        span.set_attribute("llm.prompt_tokens", pt)
        span.set_attribute("llm.completion_tokens", ct)
        span.set_attribute("llm.total_tokens", pt + ct)
        span.set_attribute("llm.estimated_cost_usd", cost)

print(f"Spans written to {SPAN_FILE}")
with open(SPAN_FILE) as f:
    records = [json.loads(line) for line in f]

print(f"Total spans: {len(records)}")
total_cost = sum(r["attributes"].get("llm.estimated_cost_usd", 0) for r in records)
print(f"Total estimated cost across {len(records)} requests: ${total_cost:.8f}")

by_user: dict[str, float] = {}
for r in records:
    user = r["attributes"].get("llm.user", "unknown")
    by_user[user] = by_user.get(user, 0) + r["attributes"].get("llm.estimated_cost_usd", 0)

print("Cost by tenant:")
for user, cost_sum in sorted(by_user.items()):
    print(f"  {user}: ${cost_sum:.8f}")

print("file_exporter_ok")

Verify it works

Run a final end-to-end check that exercises the live server, the in-memory span verifier, and the file exporter in sequence:

# Confirm the server is still running.
curl -sf http://localhost:8080/health && echo "health_ok"

# Send two more requests with different tenants.
curl -s -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3-8b", "prompt": "What is continuous batching?", "max_tokens": 50, "user": "tenant-x"}' \
  | python3 -c "import sys,json; d=json.load(sys.stdin); print('trace_id:', d['trace_id'])"

# Confirm span file has records.
python3 -c "
import json
with open('/tmp/spans.ndjson') as f:
    lines = f.readlines()
print(f'span_file_lines: {len(lines)}')
first = json.loads(lines[0])
print('first_span_name:', first['name'])
print('first_span_model:', first['attributes'].get('llm.model', 'n/a'))
print('verify_complete')
"

Connecting to a real collector (optional)

To forward spans to a local OpenTelemetry Collector, SigNoz, or Grafana Tempo, set the environment variable before starting the server:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
# Then restart: uvicorn inference_server:app --host 0.0.0.0 --port 8080
echo "otlp_env_set"

The build_tracer_provider function in otel_setup.py detects this variable and adds an OTLPSpanExporter alongside the console exporter. For SigNoz, the default gRPC endpoint is http://localhost:4317; for Grafana Tempo it’s the same port by default. For Datadog or Honeycomb, replace the endpoint and add the appropriate auth header via OTEL_EXPORTER_OTLP_HEADERS.

A minimal docker-compose.yml for a local OTel Collector with a file exporter looks like this (not executed in the sandbox, requires Docker):

# filename: docker-compose.yml (illustration only — requires Docker)
version: "3.9"
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.100.0
    ports:
      - "4317:4317"   # gRPC OTLP
      - "4318:4318"   # HTTP OTLP
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml

Troubleshooting

No spans appear in the console. The most common cause is using BatchSpanProcessor instead of SimpleSpanProcessor. The batch processor flushes asynchronously at process exit, so short-lived scripts exit before the flush. Switch to SimpleSpanProcessor for scripts and tests; keep BatchSpanProcessor only for long-running servers where you call provider.force_flush() on shutdown.

ModuleNotFoundError: No module named 'opentelemetry' after install. This happens when the install ran in a different virtual environment than the Python interpreter being used. Confirm with python3 -c "import opentelemetry; print('ok')". If it fails, re-run uv pip install opentelemetry-sdk in the same shell session.

tiktoken raises a network error on first use. tiktoken downloads the cl100k_base encoding on first call. In air-gapped environments, pre-download with python3 -c "import tiktoken; tiktoken.get_encoding('cl100k_base')" on a machine with internet access and copy the cache directory (~/.tiktoken) to the target host.

OTLP exporter raises StatusCode.UNAVAILABLE. The collector isn’t running or the endpoint is wrong. Check OTEL_EXPORTER_OTLP_ENDPOINT and confirm the collector is listening with nc -zv localhost 4317. The NdJsonFileExporter is a zero-dependency fallback that works without any collector.

Cost attribution shows 0.0 for a model. The model name passed in the request doesn’t match any key in PRICING. Add the model to the dict or rely on the "default" entry. Log attrs["llm.model"] in the span to confirm the exact string being used.

Server port 8080 already in use. Kill the existing process with lsof -ti:8080 | xargs kill -9 or change the port in the uvicorn launch command and update the curl URLs accordingly.

Next steps

Attach spans to real vLLM engine calls. Replace _simulate_prefill and _simulate_decode with calls to vllm.AsyncLLMEngine. Wrap engine.generate() in the same inference.prefill / inference.decode span pattern; vLLM’s async generator yields tokens so you can record first-token latency (TTFT) as a span event.
Add a Prometheus metrics bridge. Use opentelemetry-exporter-prometheus to expose llm_total_tokens_total and llm_estimated_cost_usd_total as Prometheus counters, then scrape them from Grafana for a cost dashboard alongside your trace data.
Implement per-tenant rate limiting. The llm.user span attribute is already set. Feed it into a Redis-backed token bucket (one key per tenant) checked at the start of the inference.request span; set a span event rate_limit.applied when the bucket is exhausted.
Export to SigNoz or Grafana Tempo. Both accept standard OTLP gRPC on port 4317. Start the stack with docker compose up -d, set OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317, and the same span structure explored here appears in the UI with zero schema changes.

FAQ

How does OpenTelemetry tracing help with vLLM inference debugging?

OpenTelemetry spans capture per-request latency, token counts, and cost attribution in structured traces. Each inference call creates a root span with child spans for prefill and decode phases, allowing operators to see exactly which phase caused a latency spike or which tenant is consuming GPU budget.

What span attributes does the instrumented server attach to each request?

The server attaches model name, request ID, prompt and completion token counts, total tokens, estimated cost in USD, and per-phase latencies. Child spans for tokenize, prefill, and decode phases carry their own attributes like prompt token count and prefill latency in milliseconds.

Can the same span structure work with different tracing backends?

Yes. The span structure is backend-agnostic; only the exporter endpoint changes. The tutorial uses a local console exporter and file exporter, but the same code works with Grafana Tempo, SigNoz, Datadog, or Honeycomb by setting the OTEL_EXPORTER_OTLP_ENDPOINT environment variable.

How does cost attribution work in the traced spans?

The cost_model module stores pricing per token for each model and calculates estimated USD cost based on prompt and completion token counts. This cost is attached as the llm.estimated_cost_usd attribute on the root span, enabling per-tenant cost tracking by grouping spans by the llm.user attribute.

What is the difference between SimpleSpanProcessor and BatchSpanProcessor?

SimpleSpanProcessor flushes spans synchronously, making it suitable for short-lived scripts and tests. BatchSpanProcessor flushes asynchronously at process exit, so short-lived scripts may exit before the flush completes; it is preferred for long-running servers where you call provider.force_flush() on shutdown.