# Self-Hosted vLLM Inference with Audit-Grade OTel Spans on Hetzner

> Deploy vLLM on a Hetzner GPU instance, wire an OpenTelemetry Collector sidecar to capture model ID, prompt tokens, and finish reason on every request, and ship those spans to a self-hosted SigNoz backend. All compute and trace data stays inside EU borders.

- Canonical URL: https://agentry.press/tutorial/self-hosted-vllm-inference-with-audit-grade-otel-spans-on-hetzner/
- Type: Tutorial
- Published: 2026-06-04
- By: agentry
- Tags: vllm, opentelemetry, hetzner, eu-sovereignty, observability, inference

---

## Why this matters

The EU AI Act's transparency obligations require operators to produce auditable records of model invocations: which model ran, on what input size, and how it terminated. Without structured span data attached to each inference call, answering a regulator's question about a specific request means grepping unstructured logs, which is slow and error-prone at scale.

vLLM's OpenAI-compatible server emits OpenTelemetry traces natively when an OTLP endpoint is configured. Pairing that with an OTel Collector sidecar and a self-hosted SigNoz instance gives you a complete, queryable audit trail where every span carries `gen_ai.model.id`, `gen_ai.usage.prompt_tokens`, `gen_ai.usage.completion_tokens`, and `gen_ai.response.finish_reasons` as first-class attributes. Because Hetzner's GPU instances (AX52, GX2, and the newer CCX-series with A100 access) are physically located in Nuremberg and Falkenstein, the data never crosses an EU border.

This tutorial walks through the full stack: Hetzner instance provisioning, vLLM server startup, OTel Collector configuration as a systemd sidecar, and a Python client that fires test requests and then queries SigNoz's API to confirm the spans landed.

## Prerequisites

- A Hetzner Cloud account with GPU quota approved (request via the Hetzner Cloud Console under "Limits")
- `hcloud` CLI installed and authenticated (`hcloud context create my-project`)
- Docker and Docker Compose on the remote instance (the tutorial installs them via cloud-init)
- Python 3.11 or 3.12 on your local machine
- Familiarity with the vLLM CLI (`vllm serve`)
- A SigNoz instance reachable from the Hetzner instance. The tutorial covers running SigNoz on a separate Hetzner CX22 (2 vCPU, 4 GB RAM) using Docker Compose. If you already have a SigNoz deployment, skip that section and substitute your OTLP endpoint.

## Setup

Install the Python packages used in the local verification and span-query scripts.

```bash
uv pip install openai opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc requests
```

Export the addresses you'll fill in after provisioning. The tutorial references these variables throughout.

```bash
export HETZNER_GPU_IP="203.0.113.10"        # replace after provisioning
export SIGNOZ_IP="203.0.113.20"             # replace after provisioning
export VLLM_PORT=8000
export OTLP_GRPC_PORT=4317
export MODEL_ID="facebook/opt-125m"         # swap for your target model
```

## Step 1: Provision the Hetzner GPU instance

Hetzner's `hcloud` CLI provisions instances with a cloud-init script. The script below installs the NVIDIA container toolkit, Docker, and pulls the vLLM Docker image on first boot. Paste it into a file, then create the server.

```bash
# filename: cloud-init-gpu.yaml
```

```yaml
#cloud-config
packages:
  - apt-transport-https
  - ca-certificates
  - curl
  - gnupg

runcmd:
  # Docker
  - curl -fsSL https://download.docker.com/linux/ubuntu/gpg | gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
  - echo "deb [arch=amd64 signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu focal stable" > /etc/apt/sources.list.d/docker.list
  - apt-get update -y
  - apt-get install -y docker-ce docker-ce-cli containerd.io docker-compose-plugin
  - systemctl enable docker
  - systemctl start docker
  # NVIDIA container toolkit
  - curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
  - curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' > /etc/apt/sources.list.d/nvidia-container-toolkit.list
  - apt-get update -y
  - apt-get install -y nvidia-container-toolkit
  - nvidia-ctk runtime configure --runtime=docker
  - systemctl restart docker
  # Pull vLLM image (background, large download)
  - docker pull vllm/vllm-openai:latest
```

Create the server (adjust `--type` to the GPU type available in your quota):

```bash
hcloud server create \
  --name vllm-eu \
  --type gx2-8 \
  --image ubuntu-22.04 \
  --location nbg1 \
  --user-data-from-file cloud-init-gpu.yaml \
  --ssh-key my-key
# After creation, note the public IPv4 and set HETZNER_GPU_IP
```

This block is illustrative. Run it from your local machine after installing `hcloud`.

## Step 2: Deploy SigNoz on a companion CX22 instance

SigNoz ships a Docker Compose stack. Provision a second, CPU-only instance and run the installer.

```bash
hcloud server create \
  --name signoz-eu \
  --type cx22 \
  --image ubuntu-22.04 \
  --location nbg1 \
  --ssh-key my-key
# SSH in, then:
# git clone -b main https://github.com/SigNoz/signoz.git
# cd signoz/deploy && docker compose -f docker/clickhouse-setup/docker-compose.yaml up -d
```

SigNoz exposes its OTLP gRPC receiver on port 4317 and its UI on port 3301. Open both in Hetzner's firewall rules for the GPU instance's IP only.

This block is illustrative. Run it from your local machine.

## Step 3: Configure the OTel Collector sidecar

The Collector runs as a Docker container on the GPU instance alongside vLLM. It receives OTLP spans from vLLM on `localhost:4317`, enriches them with a `deployment.region` resource attribute, and forwards them to SigNoz.

Write the Collector config to a file you'll `scp` to the GPU instance:

```yaml
# filename: otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  resource:
    attributes:
      - key: deployment.region
        value: "hetzner-nbg1-eu"
        action: upsert
      - key: deployment.environment
        value: "production"
        action: upsert
  batch:
    timeout: 5s
    send_batch_size: 512

exporters:
  otlp:
    endpoint: "${SIGNOZ_IP}:4317"
    tls:
      insecure: true
  logging:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resource, batch]
      exporters: [otlp, logging]
    metrics:
      receivers: [otlp]
      processors: [resource, batch]
      exporters: [otlp]
```

The `logging` exporter writes every span to the Collector's stdout, which is useful during initial setup. Remove it once you confirm spans are landing in SigNoz.

Write the Docker Compose file that starts both vLLM and the Collector:

```yaml
# filename: docker-compose-vllm.yaml
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.100.0
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml
    ports:
      - "4317:4317"
      - "4318:4318"
    environment:
      - SIGNOZ_IP=${SIGNOZ_IP}
    restart: unless-stopped

  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - VLLM_OTLP_TRACES_ENDPOINT=http://otel-collector:4317
      - VLLM_TRACE_FUNCTION=1
    command: >
      --model ${MODEL_ID}
      --port 8000
      --otlp-traces-endpoint http://otel-collector:4317
      --collect-detailed-traces all
    ports:
      - "8000:8000"
    depends_on:
      - otel-collector
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
```

The `--otlp-traces-endpoint` flag tells vLLM's built-in OTel instrumentation where to send spans. The `--collect-detailed-traces all` flag enables per-request spans including prefill and decode phases.

Deploy to the GPU instance:

```bash
# Run these on your local machine after scp-ing the files
scp otel-collector-config.yaml docker-compose-vllm.yaml ubuntu@${HETZNER_GPU_IP}:/opt/vllm/
ssh ubuntu@${HETZNER_GPU_IP} \
  "cd /opt/vllm && SIGNOZ_IP=${SIGNOZ_IP} MODEL_ID=${MODEL_ID} docker compose -f docker-compose-vllm.yaml up -d"
```

This block is illustrative. Run it from your local machine.

## Step 4: Write the audit-span client

This Python module fires a chat completion request through vLLM's OpenAI-compatible endpoint and then polls SigNoz's trace API to confirm the span attributes landed correctly.

```python
# filename: audit_client.py
import os
import time
import json
import requests
from openai import OpenAI

VLLM_BASE_URL = f"http://{os.environ.get('HETZNER_GPU_IP', '127.0.0.1')}:{os.environ.get('VLLM_PORT', '8000')}/v1"
SIGNOZ_BASE_URL = f"http://{os.environ.get('SIGNOZ_IP', '127.0.0.1')}:3301"
MODEL_ID = os.environ.get("MODEL_ID", "facebook/opt-125m")

# Required span attributes for AI Act audit records
REQUIRED_SPAN_ATTRS = [
    "gen_ai.model.id",
    "gen_ai.usage.prompt_tokens",
    "gen_ai.usage.completion_tokens",
    "gen_ai.response.finish_reasons",
    "deployment.region",
]


def send_completion(prompt: str) -> dict:
    """Send a single chat completion and return the raw response dict."""
    client = OpenAI(base_url=VLLM_BASE_URL, api_key="not-needed")
    response = client.chat.completions.create(
        model=MODEL_ID,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=64,
    )
    return {
        "id": response.id,
        "model": response.model,
        "prompt_tokens": response.usage.prompt_tokens,
        "completion_tokens": response.usage.completion_tokens,
        "finish_reason": response.choices[0].finish_reason,
    }


def query_signoz_spans(service_name: str = "vllm", lookback_minutes: int = 5) -> list:
    """Query SigNoz for recent spans from the vLLM service."""
    now_ms = int(time.time() * 1000)
    start_ms = now_ms - lookback_minutes * 60 * 1000
    url = f"{SIGNOZ_BASE_URL}/api/v1/spans"
    params = {
        "service": service_name,
        "start": start_ms,
        "end": now_ms,
        "limit": 20,
    }
    resp = requests.get(url, params=params, timeout=10)
    resp.raise_for_status()
    return resp.json().get("spans", [])


def verify_audit_attributes(spans: list) -> dict:
    """Check that at least one span carries all required audit attributes."""
    results = {attr: False for attr in REQUIRED_SPAN_ATTRS}
    for span in spans:
        attrs = span.get("attributes", {})
        for attr in REQUIRED_SPAN_ATTRS:
            if attr in attrs:
                results[attr] = True
    return results


if __name__ == "__main__":
    print(f"Sending test completion to {VLLM_BASE_URL} ...")
    try:
        result = send_completion("Summarize the EU AI Act in one sentence.")
        print(f"Response received: {json.dumps(result, indent=2)}")
    except Exception as exc:
        print(f"vLLM call failed (expected in sandbox without GPU): {exc}")

    print("\nQuerying SigNoz for audit spans ...")
    try:
        spans = query_signoz_spans()
        print(f"Found {len(spans)} recent spans")
        audit_check = verify_audit_attributes(spans)
        print("Audit attribute coverage:")
        for attr, present in audit_check.items():
            status = "PASS" if present else "MISSING"
            print(f"  {status}  {attr}")
        missing = [a for a, ok in audit_check.items() if not ok]
        if missing:
            print(f"\nWARNING: {len(missing)} required attributes not found in recent spans.")
        else:
            print("\nAll required audit attributes confirmed in SigNoz.")
    except Exception as exc:
        print(f"SigNoz query failed (expected in sandbox without live deployment): {exc}")
```

## Step 5: Emit and verify spans locally with the OTel SDK

Before deploying to Hetzner, verify that your span schema is correct by emitting a synthetic span locally using the OTel Python SDK. This confirms the attribute names and types match what SigNoz expects.

```python
# filename: emit_test_span.py
import time
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource

# Build a tracer that mimics what vLLM emits
resource = Resource.create({
    "service.name": "vllm",
    "deployment.region": "hetzner-nbg1-eu",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("vllm.inference")


def emit_inference_span(
    model_id: str,
    prompt_tokens: int,
    completion_tokens: int,
    finish_reason: str,
) -> str:
    """Emit a single inference span with all required audit attributes."""
    with tracer.start_as_current_span("llm.generate") as span:
        span.set_attribute("gen_ai.model.id", model_id)
        span.set_attribute("gen_ai.usage.prompt_tokens", prompt_tokens)
        span.set_attribute("gen_ai.usage.completion_tokens", completion_tokens)
        span.set_attribute("gen_ai.response.finish_reasons", [finish_reason])
        span.set_attribute("gen_ai.system", "vllm")
        span.set_attribute("gen_ai.operation.name", "chat")
        # Simulate inference latency
        time.sleep(0.01)
        return trace.format_trace_id(span.get_span_context().trace_id)


if __name__ == "__main__":
    trace_id = emit_inference_span(
        model_id="facebook/opt-125m",
        prompt_tokens=42,
        completion_tokens=17,
        finish_reason="stop",
    )
    print(f"trace_id={trace_id}")
```

Run the emitter and capture the console output to confirm the span structure:

```python
import io
import sys
import importlib
import importlib.util

# Run emit_test_span as a module and capture its ConsoleSpanExporter output
captured = io.StringIO()
old_stdout = sys.stdout
sys.stdout = captured

# Import and run
spec = importlib.util.spec_from_file_location("emit_test_span", "/workspace/emit_test_span.py")
mod = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mod)
trace_id = mod.emit_inference_span(
    model_id="facebook/opt-125m",
    prompt_tokens=42,
    completion_tokens=17,
    finish_reason="stop",
)

sys.stdout = old_stdout
output = captured.getvalue()

# Verify required audit attributes appear in the span output
required = [
    "gen_ai.model.id",
    "gen_ai.usage.prompt_tokens",
    "gen_ai.usage.completion_tokens",
    "gen_ai.response.finish_reasons",
    "deployment.region",
]
missing = [attr for attr in required if attr not in output]
if missing:
    print(f"FAIL: missing attributes in span output: {missing}")
    print("--- captured output ---")
    print(output[:2000])
else:
    print("PASS: all required audit attributes present in span")
print(f"trace_id={trace_id}")
```

## Step 6: Validate the Collector config locally

The OTel Collector binary can validate a config file without a running backend. Pull the contrib image and run `validate` against the config you wrote in Step 3.

```bash
docker run --rm \
  -v /workspace/otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml \
  -e SIGNOZ_IP=203.0.113.20 \
  otel/opentelemetry-collector-contrib:0.100.0 \
  validate --config /etc/otelcol-contrib/config.yaml
```

## Verify it works

With the full stack running on Hetzner, run the audit client against the live deployment:

```python
import subprocess, sys
result = subprocess.run(
    [sys.executable, "/workspace/audit_client.py"],
    capture_output=True, text=True
)
print(result.stdout)
if result.returncode != 0:
    print(result.stderr)
print("audit_client_ran=true")
```

Expected output when the stack is live:

```
Sending test completion to http://203.0.113.10:8000/v1 ...
Response received: {
  "id": "cmpl-abc123",
  "model": "facebook/opt-125m",
  "prompt_tokens": 14,
  "completion_tokens": 64,
  "finish_reason": "length"
}

Querying SigNoz for audit spans ...
Found 3 recent spans
Audit attribute coverage:
  PASS  gen_ai.model.id
  PASS  gen_ai.usage.prompt_tokens
  PASS  gen_ai.usage.completion_tokens
  PASS  gen_ai.response.finish_reasons
  PASS  deployment.region

All required audit attributes confirmed in SigNoz.
```

In the sandbox (no GPU, no live Hetzner instance), the vLLM and SigNoz calls fail gracefully and the script prints the exception messages instead.

> [!PULLQUOTE]
> Without structured span data attached to each inference call, answering a regulator's question about a specific request means grepping unstructured logs.

## Troubleshooting

**vLLM starts but no spans appear in SigNoz.** Check that `--otlp-traces-endpoint` points to the Collector container name (`http://otel-collector:4317`), not `localhost`. Inside Docker Compose, `localhost` resolves to the vLLM container itself, not the Collector sidecar. Confirm with `docker compose logs otel-collector` that the Collector is receiving connections.

**Collector exits with `invalid configuration: no exporters defined`.** The `${SIGNOZ_IP}` variable is not being substituted. Pass it explicitly with `-e SIGNOZ_IP=...` in the Docker run command, or add it to a `.env` file in the same directory as `docker-compose-vllm.yaml`. Docker Compose reads `.env` automatically.

**SigNoz UI shows spans but `gen_ai.*` attributes are missing.** Your vLLM version predates the GenAI semantic conventions support. Upgrade to vLLM 0.4.0 or later, which added `gen_ai.usage.*` attributes. Check the running version with `docker exec vllm-vllm-1 python -c "import importlib.metadata; print(importlib.metadata.version('vllm'))"` .

**Hetzner GPU quota request is pending.** GPU quota is not granted automatically. Submit the request in the Hetzner Cloud Console under Account > Limits, describe your workload, and expect 1-3 business days. In the meantime, test the OTel pipeline on a CPU-only CX31 instance with a small model (`facebook/opt-125m`) by removing the `runtime: nvidia` and `deploy.resources` keys from the Compose file.

**SigNoz ClickHouse container runs out of disk.** The default SigNoz Docker Compose stack retains 30 days of trace data. On a CX22 (40 GB disk), a busy inference endpoint can fill this in days. Set `STORAGE_DURATION_HOURS=168` (7 days) in SigNoz's `.env` before starting the stack, or mount a Hetzner Volume for the ClickHouse data directory.

**`docker compose` command not found on the GPU instance.** The cloud-init script installs `docker-compose-plugin`, which exposes the command as `docker compose` (with a space), not `docker-compose` (with a hyphen). If you have scripts that call the hyphenated form, install the standalone binary: `apt-get install -y docker-compose`.

## Next steps

- Add a Prometheus metrics receiver to the Collector config and scrape vLLM's `/metrics` endpoint for GPU utilization, KV cache hit rate, and queue depth alongside the trace data.
- Write a Grafana dashboard that joins SigNoz trace attributes with Prometheus metrics on `trace_id` to correlate latency spikes with cache miss events.
- Implement a span processor that redacts PII from `gen_ai.prompt` attributes before they leave the GPU instance, satisfying GDPR Article 25 data-minimisation requirements.
- Extend the `audit_client.py` verification script into a nightly CI job that queries SigNoz and fails the build if any required audit attribute is absent from the previous 24 hours of spans.

## FAQ

### What span attributes does vLLM emit for audit compliance?

vLLM emits gen_ai.model.id, gen_ai.usage.prompt_tokens, gen_ai.usage.completion_tokens, and gen_ai.response.finish_reasons as first-class span attributes when an OTLP endpoint is configured. The OTel Collector adds deployment.region and deployment.environment attributes for full audit context.

### How does the OTel Collector connect vLLM to SigNoz?

The Collector runs as a Docker sidecar on the same Hetzner instance, receives OTLP spans from vLLM on localhost:4317, enriches them with deployment metadata, and forwards them to SigNoz's OTLP gRPC endpoint on port 4317. vLLM is configured with --otlp-traces-endpoint pointing to the Collector container by name.

### Why must vLLM use the Collector container name instead of localhost?

Inside Docker Compose, localhost resolves to the vLLM container itself, not the Collector sidecar. Using the container name (http://otel-collector:4317) ensures vLLM reaches the Collector through Docker's internal DNS.

### What Hetzner instance types support this setup?

GPU instances AX52, GX2, and CCX-series with A100 access in Nuremberg (nbg1) and Falkenstein (fsn1) locations are suitable. The tutorial uses gx2-8 as an example. SigNoz runs on a separate CX22 CPU instance.

### How can I verify spans are landing in SigNoz before deploying to production?

Run emit_test_span.py locally to emit a synthetic span with the OTel SDK and ConsoleSpanExporter, confirming the attribute schema matches SigNoz expectations. Then validate the Collector config with docker run otel/opentelemetry-collector-contrib validate before deploying to Hetzner.

## References

1. https://github.com/vercel-labs/open-agents
2. https://openai.com/index/running-codex-safely
