Self-Hosted vLLM Inference with Audit-Grade OTel Spans on Hetzner

Why this matters

The EU AI Act’s transparency obligations require operators to produce auditable records of model invocations: which model ran, on what input size, and how it terminated. Without structured span data attached to each inference call, answering a regulator’s question about a specific request means grepping unstructured logs, which is slow and error-prone at scale.

vLLM’s OpenAI-compatible server emits OpenTelemetry traces natively when an OTLP endpoint is configured. Pairing that with an OTel Collector sidecar and a self-hosted SigNoz instance gives you a complete, queryable audit trail where every span carries gen_ai.model.id, gen_ai.usage.prompt_tokens, gen_ai.usage.completion_tokens, and gen_ai.response.finish_reasons as first-class attributes. Because Hetzner’s GPU instances (AX52, GX2, and the newer CCX-series with A100 access) are physically located in Nuremberg and Falkenstein, the data never crosses an EU border.

This tutorial walks through the full stack: Hetzner instance provisioning, vLLM server startup, OTel Collector configuration as a systemd sidecar, and a Python client that fires test requests and then queries SigNoz’s API to confirm the spans landed.

Prerequisites

A Hetzner Cloud account with GPU quota approved (request via the Hetzner Cloud Console under “Limits”)
hcloud CLI installed and authenticated (hcloud context create my-project)
Docker and Docker Compose on the remote instance (the tutorial installs them via cloud-init)
Python 3.11 or 3.12 on your local machine
Familiarity with the vLLM CLI (vllm serve)
A SigNoz instance reachable from the Hetzner instance. The tutorial covers running SigNoz on a separate Hetzner CX22 (2 vCPU, 4 GB RAM) using Docker Compose. If you already have a SigNoz deployment, skip that section and substitute your OTLP endpoint.

Setup

Install the Python packages used in the local verification and span-query scripts.

uv pip install openai opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc requests

Export the addresses you’ll fill in after provisioning. The tutorial references these variables throughout.

export HETZNER_GPU_IP="203.0.113.10"        # replace after provisioning
export SIGNOZ_IP="203.0.113.20"             # replace after provisioning
export VLLM_PORT=8000
export OTLP_GRPC_PORT=4317
export MODEL_ID="facebook/opt-125m"         # swap for your target model

Step 1: Provision the Hetzner GPU instance

Hetzner’s hcloud CLI provisions instances with a cloud-init script. The script below installs the NVIDIA container toolkit, Docker, and pulls the vLLM Docker image on first boot. Paste it into a file, then create the server.

# filename: cloud-init-gpu.yaml

#cloud-config
packages:
  - apt-transport-https
  - ca-certificates
  - curl
  - gnupg

runcmd:
  # Docker
  - curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
  - echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu focal stable" > /etc/apt/sources.list.d/docker.list
  - apt-get update -y
  - apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
  - systemctl enable docker
  - systemctl start docker
  # NVIDIA container toolkit
  - curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
  - curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' > /etc/apt/sources.list.d/nvidia-container-toolkit.list
  - apt-get update -y
  - apt-get install -y nvidia-container-toolkit
  - nvidia-ctk runtime configure --runtime=docker
  - systemctl restart docker
  # Pull vLLM image (background, large download)
  - docker pull vllm/vllm-openai:latest

Create the server (adjust --type to the GPU type available in your quota):

hcloud server create \
  --name vllm-eu \
  --type gx2-8 \
  --image ubuntu-22.04 \
  --location nbg1 \
  --user-data-from-file cloud-init-gpu.yaml \
  --ssh-key my-key
# After creation, note the public IPv4 and set HETZNER_GPU_IP

This block is illustrative. Run it from your local machine after installing hcloud.

Step 2: Deploy SigNoz on a companion CX22 instance

SigNoz ships a Docker Compose stack. Provision a second, CPU-only instance and run the installer.

hcloud server create \
  --name signoz-eu \
  --type cx22 \
  --image ubuntu-22.04 \
  --location nbg1 \
  --ssh-key my-key
# SSH in, then:
# git clone -b main https://github.com/SigNoz/signoz.git
# cd signoz/deploy && docker compose -f docker/clickhouse-setup/docker-compose.yaml up -d

SigNoz exposes its OTLP gRPC receiver on port 4317 and its UI on port 3301. Open both in Hetzner’s firewall rules for the GPU instance’s IP only.

This block is illustrative. Run it from your local machine.

Step 3: Configure the OTel Collector sidecar

The Collector runs as a Docker container on the GPU instance alongside vLLM. It receives OTLP spans from vLLM on localhost:4317, enriches them with a deployment.region resource attribute, and forwards them to SigNoz.

Write the Collector config to a file you’ll scp to the GPU instance:

# filename: otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  resource:
    attributes:
      - key: deployment.region
        value: "hetzner-nbg1-eu"
        action: upsert
      - key: deployment.environment
        value: "production"
        action: upsert
  batch:
    timeout: 5s
    send_batch_size: 512

exporters:
  otlp:
    endpoint: "${SIGNOZ_IP}:4317"
    tls:
      insecure: true
  logging:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resource, batch]
      exporters: [otlp, logging]
    metrics:
      receivers: [otlp]
      processors: [resource, batch]
      exporters: [otlp]

The logging exporter writes every span to the Collector’s stdout, which is useful during initial setup. Remove it once you confirm spans are landing in SigNoz.

Write the Docker Compose file that starts both vLLM and the Collector:

# filename: docker-compose-vllm.yaml
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.100.0
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml
    ports:
      - "4317:4317"
      - "4318:4318"
    environment:
      - SIGNOZ_IP=${SIGNOZ_IP}
    restart: unless-stopped

  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - VLLM_OTLP_TRACES_ENDPOINT=http://otel-collector:4317
      - VLLM_TRACE_FUNCTION=1
    command: >
      --model ${MODEL_ID}
      --port 8000
      --otlp-traces-endpoint http://otel-collector:4317
      --collect-detailed-traces all
    ports:
      - "8000:8000"
    depends_on:
      - otel-collector
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

The --otlp-traces-endpoint flag tells vLLM’s built-in OTel instrumentation where to send spans. The --collect-detailed-traces all flag enables per-request spans including prefill and decode phases.

Deploy to the GPU instance:

# Run these on your local machine after scp-ing the files
scp otel-collector-config.yaml docker-compose-vllm.yaml ubuntu@${HETZNER_GPU_IP}:/opt/vllm/
ssh ubuntu@${HETZNER_GPU_IP} \
  "cd /opt/vllm && SIGNOZ_IP=${SIGNOZ_IP} MODEL_ID=${MODEL_ID} docker compose -f docker-compose-vllm.yaml up -d"

This block is illustrative. Run it from your local machine.

Step 4: Write the audit-span client

This Python module fires a chat completion request through vLLM’s OpenAI-compatible endpoint and then polls SigNoz’s trace API to confirm the span attributes landed correctly.

# filename: audit_client.py
import os
import time
import json
import requests
from openai import OpenAI

VLLM_BASE_URL = f"http://{os.environ.get('HETZNER_GPU_IP', '127.0.0.1')}:{os.environ.get('VLLM_PORT', '8000')}/v1"
SIGNOZ_BASE_URL = f"http://{os.environ.get('SIGNOZ_IP', '127.0.0.1')}:3301"
MODEL_ID = os.environ.get("MODEL_ID", "facebook/opt-125m")

# Required span attributes for AI Act audit records
REQUIRED_SPAN_ATTRS = [
    "gen_ai.model.id",
    "gen_ai.usage.prompt_tokens",
    "gen_ai.usage.completion_tokens",
    "gen_ai.response.finish_reasons",
    "deployment.region",
]


def send_completion(prompt: str) -> dict:
    """Send a single chat completion and return the raw response dict."""
    client = OpenAI(base_url=VLLM_BASE_URL, api_key="not-needed")
    response = client.chat.completions.create(
        model=MODEL_ID,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=64,
    )
    return {
        "id": response.id,
        "model": response.model,
        "prompt_tokens": response.usage.prompt_tokens,
        "completion_tokens": response.usage.completion_tokens,
        "finish_reason": response.choices[0].finish_reason,
    }


def query_signoz_spans(service_name: str = "vllm", lookback_minutes: int = 5) -> list:
    """Query SigNoz for recent spans from the vLLM service."""
    now_ms = int(time.time() * 1000)
    start_ms = now_ms - lookback_minutes * 60 * 1000
    url = f"{SIGNOZ_BASE_URL}/api/v1/spans"
    params = {
        "service": service_name,
        "start": start_ms,
        "end": now_ms,
        "limit": 20,
    }
    resp = requests.get(url, params=params, timeout=10)
    resp.raise_for_status()
    return resp.json().get("spans", [])


def verify_audit_attributes(spans: list) -> dict:
    """Check that at least one span carries all required audit attributes."""
    results = {attr: False for attr in REQUIRED_SPAN_ATTRS}
    for span in spans:
        attrs = span.get("attributes", {})
        for attr in REQUIRED_SPAN_ATTRS:
            if attr in attrs:
                results[attr] = True
    return results


if __name__ == "__main__":
    print(f"Sending test completion to {VLLM_BASE_URL} ...")
    try:
        result = send_completion("Summarize the EU AI Act in one sentence.")
        print(f"Response received: {json.dumps(result, indent=2)}")
    except Exception as exc:
        print(f"vLLM call failed (expected in sandbox without GPU): {exc}")

    print("\nQuerying SigNoz for audit spans ...")
    try:
        spans = query_signoz_spans()
        print(f"Found {len(spans)} recent spans")
        audit_check = verify_audit_attributes(spans)
        print("Audit attribute coverage:")
        for attr, present in audit_check.items():
            status = "PASS" if present else "MISSING"
            print(f"  {status}  {attr}")
        missing = [a for a, ok in audit_check.items() if not ok]
        if missing:
            print(f"\nWARNING: {len(missing)} required attributes not found in recent spans.")
        else:
            print("\nAll required audit attributes confirmed in SigNoz.")
    except Exception as exc:
        print(f"SigNoz query failed (expected in sandbox without live deployment): {exc}")

Step 5: Emit and verify spans locally with the OTel SDK

Before deploying to Hetzner, verify that your span schema is correct by emitting a synthetic span locally using the OTel Python SDK. This confirms the attribute names and types match what SigNoz expects.

# filename: emit_test_span.py
import time
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource

# Build a tracer that mimics what vLLM emits
resource = Resource.create({
    "service.name": "vllm",
    "deployment.region": "hetzner-nbg1-eu",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("vllm.inference")


def emit_inference_span(
    model_id: str,
    prompt_tokens: int,
    completion_tokens: int,
    finish_reason: str,
) -> str:
    """Emit a single inference span with all required audit attributes."""
    with tracer.start_as_current_span("llm.generate") as span:
        span.set_attribute("gen_ai.model.id", model_id)
        span.set_attribute("gen_ai.usage.prompt_tokens", prompt_tokens)
        span.set_attribute("gen_ai.usage.completion_tokens", completion_tokens)
        span.set_attribute("gen_ai.response.finish_reasons", [finish_reason])
        span.set_attribute("gen_ai.system", "vllm")
        span.set_attribute("gen_ai.operation.name", "chat")
        # Simulate inference latency
        time.sleep(0.01)
        return trace.format_trace_id(span.get_span_context().trace_id)


if __name__ == "__main__":
    trace_id = emit_inference_span(
        model_id="facebook/opt-125m",
        prompt_tokens=42,
        completion_tokens=17,
        finish_reason="stop",
    )
    print(f"trace_id={trace_id}")

Run the emitter and capture the console output to confirm the span structure:

import io
import sys
import importlib
import importlib.util

# Run emit_test_span as a module and capture its ConsoleSpanExporter output
captured = io.StringIO()
old_stdout = sys.stdout
sys.stdout = captured

# Import and run
spec = importlib.util.spec_from_file_location("emit_test_span", "/workspace/emit_test_span.py")
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
trace_id = mod.emit_inference_span(
    model_id="facebook/opt-125m",
    prompt_tokens=42,
    completion_tokens=17,
    finish_reason="stop",
)

sys.stdout = old_stdout
output = captured.getvalue()

# Verify required audit attributes appear in the span output
required = [
    "gen_ai.model.id",
    "gen_ai.usage.prompt_tokens",
    "gen_ai.usage.completion_tokens",
    "gen_ai.response.finish_reasons",
    "deployment.region",
]
missing = [attr for attr in required if attr not in output]
if missing:
    print(f"FAIL: missing attributes in span output: {missing}")
    print("--- captured output ---")
    print(output[:2000])
else:
    print("PASS: all required audit attributes present in span")
print(f"trace_id={trace_id}")

Step 6: Validate the Collector config locally

The OTel Collector binary can validate a config file without a running backend. Pull the contrib image and run validate against the config you wrote in Step 3.

docker run --rm \
  -v /workspace/otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml \
  -e SIGNOZ_IP=203.0.113.20 \
  otel/opentelemetry-collector-contrib:0.100.0 \
  validate --config /etc/otelcol-contrib/config.yaml

Verify it works

With the full stack running on Hetzner, run the audit client against the live deployment:

import subprocess, sys
result = subprocess.run(
    [sys.executable, "/workspace/audit_client.py"],
    capture_output=True, text=True
)
print(result.stdout)
if result.returncode != 0:
    print(result.stderr)
print("audit_client_ran=true")

Expected output when the stack is live:

Sending test completion to http://203.0.113.10:8000/v1 ...
Response received: {
  "id": "cmpl-abc123",
  "model": "facebook/opt-125m",
  "prompt_tokens": 14,
  "completion_tokens": 64,
  "finish_reason": "length"
}

Querying SigNoz for audit spans ...
Found 3 recent spans
Audit attribute coverage:
  PASS  gen_ai.model.id
  PASS  gen_ai.usage.prompt_tokens
  PASS  gen_ai.usage.completion_tokens
  PASS  gen_ai.response.finish_reasons
  PASS  deployment.region

All required audit attributes confirmed in SigNoz.

In the sandbox (no GPU, no live Hetzner instance), the vLLM and SigNoz calls fail gracefully and the script prints the exception messages instead.

Without structured span data attached to each inference call, answering a regulator’s question about a specific request means grepping unstructured logs.

Troubleshooting

vLLM starts but no spans appear in SigNoz. Check that --otlp-traces-endpoint points to the Collector container name (http://otel-collector:4317), not localhost. Inside Docker Compose, localhost resolves to the vLLM container itself, not the Collector sidecar. Confirm with docker compose logs otel-collector that the Collector is receiving connections.

Collector exits with invalid configuration: no exporters defined. The ${SIGNOZ_IP} variable is not being substituted. Pass it explicitly with -e SIGNOZ_IP=... in the Docker run command, or add it to a .env file in the same directory as docker-compose-vllm.yaml. Docker Compose reads .env automatically.

SigNoz UI shows spans but gen_ai.* attributes are missing. Your vLLM version predates the GenAI semantic conventions support. Upgrade to vLLM 0.4.0 or later, which added gen_ai.usage.* attributes. Check the running version with docker exec vllm-vllm-1 python -c "import importlib.metadata; print(importlib.metadata.version('vllm'))" .

Hetzner GPU quota request is pending. GPU quota is not granted automatically. Submit the request in the Hetzner Cloud Console under Account > Limits, describe your workload, and expect 1-3 business days. In the meantime, test the OTel pipeline on a CPU-only CX31 instance with a small model (facebook/opt-125m) by removing the runtime: nvidia and deploy.resources keys from the Compose file.

SigNoz ClickHouse container runs out of disk. The default SigNoz Docker Compose stack retains 30 days of trace data. On a CX22 (40 GB disk), a busy inference endpoint can fill this in days. Set STORAGE_DURATION_HOURS=168 (7 days) in SigNoz’s .env before starting the stack, or mount a Hetzner Volume for the ClickHouse data directory.

docker compose command not found on the GPU instance. The cloud-init script installs docker-compose-plugin, which exposes the command as docker compose (with a space), not docker-compose (with a hyphen). If you have scripts that call the hyphenated form, install the standalone binary: apt-get install -y docker-compose.

Next steps

Add a Prometheus metrics receiver to the Collector config and scrape vLLM’s /metrics endpoint for GPU utilization, KV cache hit rate, and queue depth alongside the trace data.
Write a Grafana dashboard that joins SigNoz trace attributes with Prometheus metrics on trace_id to correlate latency spikes with cache miss events.
Implement a span processor that redacts PII from gen_ai.prompt attributes before they leave the GPU instance, satisfying GDPR Article 25 data-minimisation requirements.
Extend the audit_client.py verification script into a nightly CI job that queries SigNoz and fails the build if any required audit attribute is absent from the previous 24 hours of spans.

FAQ

What span attributes does vLLM emit for audit compliance?

vLLM emits gen_ai.model.id, gen_ai.usage.prompt_tokens, gen_ai.usage.completion_tokens, and gen_ai.response.finish_reasons as first-class span attributes when an OTLP endpoint is configured. The OTel Collector adds deployment.region and deployment.environment attributes for full audit context.

How does the OTel Collector connect vLLM to SigNoz?

The Collector runs as a Docker sidecar on the same Hetzner instance, receives OTLP spans from vLLM on localhost:4317, enriches them with deployment metadata, and forwards them to SigNoz’s OTLP gRPC endpoint on port 4317. vLLM is configured with —otlp-traces-endpoint pointing to the Collector container by name.

Why must vLLM use the Collector container name instead of localhost?

Inside Docker Compose, localhost resolves to the vLLM container itself, not the Collector sidecar. Using the container name (http://otel-collector:4317) ensures vLLM reaches the Collector through Docker’s internal DNS.

What Hetzner instance types support this setup?

GPU instances AX52, GX2, and CCX-series with A100 access in Nuremberg (nbg1) and Falkenstein (fsn1) locations are suitable. The tutorial uses gx2-8 as an example. SigNoz runs on a separate CX22 CPU instance.

How can I verify spans are landing in SigNoz before deploying to production?

Run emit_test_span.py locally to emit a synthetic span with the OTel SDK and ConsoleSpanExporter, confirming the attribute schema matches SigNoz expectations. Then validate the Collector config with docker run otel/opentelemetry-collector-contrib validate before deploying to Hetzner.