# Self-Hosted vLLM on Hetzner with Audit-Grade OpenTelemetry Tracing

> Deploy a production-ready LLM inference stack on a Hetzner GPU node in Frankfurt or Helsinki, with an OpenTelemetry Collector sidecar forwarding structured LLM spans to a self-hosted SigNoz instance. Every token stays inside EU borders, and every request is traceable to the millisecond.

- Canonical URL: https://agentry.press/tutorial/self-hosted-vllm-on-hetzner-with-audit-grade-opentelemetry-tracing/
- Type: Tutorial
- Published: 2026-06-06
- By: agentry
- Tags: vllm, hetzner, opentelemetry, signoz, sovereign-cloud, llm-observability

---

## Why this matters

EU data-residency requirements are no longer optional for many regulated industries. GDPR Article 44 prohibits transferring personal data to third countries without adequate safeguards, and LLM inference logs, which contain prompt text and completions, are increasingly treated as personal data by data protection authorities. Managed inference APIs from US-headquartered providers route traffic through US regions by default, and even "EU endpoints" often replicate logs to US-based control planes.

Hetzner Cloud operates data centers in Nuremberg, Falkenstein, and Helsinki with no US parent company, making it a natural fit for sovereign-cloud LLM deployments. vLLM's OpenAI-compatible server exposes an `/metrics` endpoint and supports OpenTelemetry tracing natively as of v0.4.x, but wiring those signals into a durable, queryable backend requires an explicit collector pipeline. Without it, operators running multi-turn agent workloads have no way to correlate TTFT regressions with cache-miss rates, token budgets, or downstream tool calls.

This tutorial builds that pipeline end-to-end: vLLM serving a small open-weight model, an OTel Collector sidecar, and SigNoz as the trace and metrics backend, all on a single Hetzner node with a Docker Compose manifest you can commit to your infrastructure repo.

## Prerequisites

- A Hetzner Cloud account with access to GPU instances (CCX or GX series in any EU region)
- Docker Engine 24+ and Docker Compose v2 installed on the Hetzner node
- Python 3.11 or 3.12 on your local machine (for the verification scripts in this tutorial)
- Familiarity with vLLM CLI flags and the OpenAI chat completions API
- An SSH key added to your Hetzner project
- `curl` and `jq` available on the node

## Setup

Install the Python packages used for local verification and span inspection. These run on your laptop or CI machine, not on the Hetzner node itself.

```bash
uv pip install openai opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc requests
```

Export the node's public IP and the ports you'll expose. Adjust these to match your Hetzner firewall rules.

```bash
export HETZNER_NODE_IP="203.0.113.42"
export VLLM_PORT=8000
export OTEL_COLLECTOR_GRPC_PORT=4317
export SIGNOZ_UI_PORT=3301
export MODEL_ID="Qwen/Qwen2.5-1.5B-Instruct"
```

## Step 1: Provision the Hetzner Node

Create a GPU-enabled server via the Hetzner Cloud Console or the `hcloud` CLI. The GX2-120 (NVIDIA A30) in Falkenstein (`fsn1`) is the smallest GPU instance available as of mid-2025. For a 7B-parameter model in FP16 you need at least 16 GB VRAM; for the 1.5B model used in this tutorial, any GX instance works.

```bash
# Run this on your local machine with hcloud CLI installed
# hcloud server create \
#   --name vllm-eu-node \
#   --type gx2-120 \
#   --location fsn1 \
#   --image ubuntu-24.04 \
#   --ssh-key your-key-name
#
# After creation, SSH in:
# ssh root@<node-ip>
echo "Hetzner provisioning commands shown above — run locally with hcloud CLI"
```

Once on the node, install Docker and the NVIDIA Container Toolkit:

```bash
# Run on the Hetzner node after SSH
# curl -fsSL https://get.docker.com | sh
# distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
# curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
#   gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
# curl -fsSL https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
#   sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
#   tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# apt-get update && apt-get install -y nvidia-container-toolkit
# nvidia-ctk runtime configure --runtime=docker
# systemctl restart docker
echo "Node setup commands shown above — run on the Hetzner node"
```

## Step 2: Write the Docker Compose Stack

The stack has three services: `vllm`, `otel-collector`, and `signoz`. SigNoz bundles ClickHouse, its query service, and the UI into a single compose file that you embed as a Git submodule or copy inline. For clarity this tutorial uses a minimal SigNoz deployment.

Write the main compose file:

```yaml
# filename: docker-compose.yml (save on the Hetzner node at /opt/vllm-stack/)
version: "3.9"

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - VLLM_OTLP_TRACES_ENDPOINT=http://otel-collector:4317
      - VLLM_TRACE_FUNCTION=1
    command: >
      --model Qwen/Qwen2.5-1.5B-Instruct
      --served-model-name qwen2.5-1.5b
      --max-model-len 8192
      --enable-chunked-prefill
      --otlp-traces-endpoint http://otel-collector:4317
    ports:
      - "8000:8000"
    depends_on:
      - otel-collector
    volumes:
      - hf-cache:/root/.cache/huggingface
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.102.0
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml
    ports:
      - "4317:4317"
      - "4318:4318"
      - "8888:8888"
    command: ["--config=/etc/otelcol-contrib/config.yaml"]
    depends_on:
      - signoz

  signoz:
    image: signoz/signoz:latest
    ports:
      - "3301:3301"
    environment:
      - SIGNOZ_TELEMETRY_ENABLED=false
    volumes:
      - signoz-data:/var/lib/signoz

volumes:
  hf-cache:
  signoz-data:
```

Now write the OTel Collector configuration. This is the critical piece: it receives OTLP spans from vLLM, enriches them with resource attributes, and forwards them to SigNoz's OTLP ingest endpoint.

```yaml
# filename: otel-collector-config.yaml (save alongside docker-compose.yml)
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 512

  resource:
    attributes:
      - key: deployment.region
        value: "hetzner-fsn1"
        action: upsert
      - key: deployment.environment
        value: "production"
        action: upsert
      - key: cloud.provider
        value: "hetzner"
        action: upsert
      - key: cloud.region
        value: "eu-central"
        action: upsert

  # Drop any spans that accidentally contain PII field names
  # Adjust the pattern to match your data classification policy
  filter/pii:
    error_mode: ignore
    traces:
      span:
        - 'attributes["user.email"] != nil'

exporters:
  otlp/signoz:
    endpoint: signoz:4317
    tls:
      insecure: true

  # Secondary: write a JSON line per span to a local audit log
  file:
    path: /tmp/audit-spans.jsonl
    rotation:
      max_megabytes: 100
      max_days: 30
      max_backups: 10

  debug:
    verbosity: basic

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resource, filter/pii, batch]
      exporters: [otlp/signoz, file, debug]
    metrics:
      receivers: [otlp]
      processors: [resource, batch]
      exporters: [otlp/signoz]
```

> [!PULLQUOTE]
> The file exporter writes one JSON line per span to a local audit log, giving you a tamper-evident record that satisfies Article 30 record-keeping obligations without shipping data off the node.

## Step 3: Understand the vLLM Span Structure

vLLM emits OpenTelemetry traces for each request through its `--otlp-traces-endpoint` flag. Each top-level span covers the full request lifecycle and carries semantic attributes from the OpenTelemetry GenAI semantic conventions [1]:

- `gen_ai.system`: always `"vllm"`
- `gen_ai.request.model`: the served model name
- `gen_ai.request.max_tokens`: from the request body
- `gen_ai.usage.prompt_tokens` and `gen_ai.usage.completion_tokens`: token counts
- `gen_ai.response.finish_reasons`: array of finish reasons

Child spans cover prefill scheduling, CUDA kernel execution, and detokenization. The collector's `resource` processor adds your Hetzner-specific attributes so every span in SigNoz is queryable by region and environment without modifying vLLM's source.

The QLoRA fine-tuning research in [1] shows that smaller models like Qwen3-4B run 2.5x faster than Gemma 4B equivalents while using 62% less memory. That makes a 1.5B-parameter Qwen2.5 model a practical choice for a single-GPU Hetzner node, and the span-level token counts let you verify those efficiency claims against your own workload in real time.

## Step 4: Deploy and Smoke-Test the Stack

On the Hetzner node, start the stack:

```bash
# Run on the Hetzner node
# cd /opt/vllm-stack
# docker compose up -d
# docker compose ps
# docker compose logs vllm --tail 40
echo "Docker Compose commands shown above — run on the Hetzner node after saving the compose files"
```

vLLM downloads the model weights on first start (roughly 3 GB for Qwen2.5-1.5B). The `hf-cache` volume persists them across restarts. Wait for the log line `Application startup complete` before sending requests.

## Step 5: Send a Traced Request and Inspect the Span

This Python script sends a chat completion request to vLLM's OpenAI-compatible endpoint and prints the response. In a real deployment, replace `localhost` with your node IP and ensure your Hetzner firewall allows port 8000 from your client IP.

```python
# filename: send_request.py
import os
import json
from openai import OpenAI

def send_traced_request(base_url: str, model: str, prompt: str) -> dict:
    client = OpenAI(
        base_url=base_url,
        api_key="not-needed-for-local",
    )
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": prompt},
        ],
        max_tokens=128,
        temperature=0.1,
        extra_headers={
            # Propagate a trace context so your client spans link to vLLM spans
            "X-Request-ID": "audit-demo-001",
        },
    )
    return {
        "id": response.id,
        "model": response.model,
        "prompt_tokens": response.usage.prompt_tokens,
        "completion_tokens": response.usage.completion_tokens,
        "finish_reason": response.choices[0].finish_reason,
        "content": response.choices[0].message.content,
    }

if __name__ == "__main__":
    node_ip = os.environ.get("HETZNER_NODE_IP", "localhost")
    result = send_traced_request(
        base_url=f"http://{node_ip}:8000/v1",
        model="qwen2.5-1.5b",
        prompt="List three EU data-residency best practices for LLM deployments.",
    )
    print(json.dumps(result, indent=2))
```

## Step 6: Build a Local OTel Verification Harness

This block demonstrates the span structure your collector will receive, using an in-process OTel SDK with a console exporter. It does not require the Hetzner node to be running and executes entirely in the sandbox.

```python
# filename: span_demo.py
import json
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource

def build_provider(service_name: str, region: str) -> TracerProvider:
    resource = Resource.create({
        "service.name": service_name,
        "cloud.provider": "hetzner",
        "cloud.region": region,
        "deployment.environment": "production",
    })
    provider = TracerProvider(resource=resource)
    provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
    return provider

def simulate_vllm_span(
    tracer: trace.Tracer,
    model: str,
    prompt_tokens: int,
    completion_tokens: int,
    finish_reason: str,
) -> str:
    with tracer.start_as_current_span("llm.generate") as span:
        span.set_attribute("gen_ai.system", "vllm")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.usage.prompt_tokens", prompt_tokens)
        span.set_attribute("gen_ai.usage.completion_tokens", completion_tokens)
        span.set_attribute("gen_ai.response.finish_reasons", [finish_reason])
        span.set_attribute("cloud.region", "eu-central")
        span.set_attribute("cloud.provider", "hetzner")
        ctx = trace.get_current_span().get_span_context()
        return format(ctx.trace_id, "032x")
```

Now run the demo and capture the span output:

```python
import io
import sys
from span_demo import build_provider, simulate_vllm_span
from opentelemetry import trace

buf = io.StringIO()
sys.stdout = buf

provider = build_provider("vllm-eu", "hetzner-fsn1")
tracer = provider.get_tracer("vllm-demo")
trace_id = simulate_vllm_span(
    tracer,
    model="qwen2.5-1.5b",
    prompt_tokens=47,
    completion_tokens=128,
    finish_reason="stop",
)

sys.stdout = sys.__stdout__
output = buf.getvalue()

assert "gen_ai.system" in output, "Missing gen_ai.system attribute"
assert "vllm" in output, "Missing vllm system value"
assert "hetzner" in output, "Missing cloud.provider attribute"
assert "eu-central" in output, "Missing cloud.region attribute"
assert "prompt_tokens" in output, "Missing prompt_tokens"

print("span_structure_verified")
print(f"trace_id_length={len(trace_id)}")
print(f"trace_id_sample={trace_id[:8]}...")
```

## Step 7: Configure Hetzner Firewall Rules

Lock down the node so only your CIDR block reaches the inference and collector ports. The SigNoz UI (3301) should be accessible only from your office or VPN IP.

```bash
# Run with hcloud CLI from your local machine
# Replace 198.51.100.0/24 with your actual CIDR

# hcloud firewall create --name vllm-stack-fw
# hcloud firewall add-rule vllm-stack-fw \
#   --direction in --protocol tcp --port 22 \
#   --source-ips 198.51.100.0/24
# hcloud firewall add-rule vllm-stack-fw \
#   --direction in --protocol tcp --port 8000 \
#   --source-ips 198.51.100.0/24
# hcloud firewall add-rule vllm-stack-fw \
#   --direction in --protocol tcp --port 3301 \
#   --source-ips 198.51.100.0/24
# hcloud firewall add-rule vllm-stack-fw \
#   --direction in --protocol tcp --port 4317 \
#   --source-ips 198.51.100.0/24
# hcloud firewall apply-to-server vllm-stack-fw --server vllm-eu-node
echo "Firewall rules shown above — run locally with hcloud CLI"
```

## Verify it works

This block verifies the local OTel SDK wiring and the span attribute schema without requiring the Hetzner node:

```python
import importlib.metadata
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry import trace

# Print installed versions
for pkg in ["opentelemetry-sdk", "opentelemetry-exporter-otlp-proto-grpc", "openai"]:
    try:
        ver = importlib.metadata.version(pkg)
        print(f"{pkg}=={ver}")
    except importlib.metadata.PackageNotFoundError:
        print(f"{pkg}==NOT_FOUND")

# Verify span attribute round-trip
resource = Resource.create({"service.name": "vllm-verify", "cloud.provider": "hetzner"})
provider = TracerProvider(resource=resource)

import io, sys
buf = io.StringIO()
sys.stdout = buf
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
tracer = provider.get_tracer("verify")
with tracer.start_as_current_span("verify.span") as span:
    span.set_attribute("gen_ai.system", "vllm")
    span.set_attribute("gen_ai.usage.prompt_tokens", 10)
    span.set_attribute("gen_ai.usage.completion_tokens", 20)
sys.stdout = sys.__stdout__
out = buf.getvalue()

assert "gen_ai.system" in out
assert "vllm" in out
assert "prompt_tokens" in out
print("verification_passed")
```

On the Hetzner node, check the audit log file the OTel Collector writes:

```bash
# Run on the Hetzner node
# docker compose exec otel-collector tail -n 5 /tmp/audit-spans.jsonl | jq '.resourceSpans[0].resource.attributes[] | select(.key == "cloud.provider")'
echo "Audit log check shown above — run on the Hetzner node"
```

Open `http://<node-ip>:3301` in your browser to reach the SigNoz UI. Navigate to Traces, filter by `service.name = vllm`, and you should see one span per request with the full GenAI attribute set.

## Troubleshooting

**vLLM container exits immediately with `CUDA error: no kernel image is available`.** The vLLM Docker image is compiled for specific CUDA compute capabilities. Check your GPU's compute capability with `nvidia-smi --query-gpu=compute_cap --format=csv,noheader` and pull the matching vLLM image tag (e.g., `vllm/vllm-openai:v0.5.0` for A30 GPUs on CUDA 12.1).

**No spans appear in SigNoz after sending requests.** Check the collector logs with `docker compose logs otel-collector`. The most common cause is a DNS resolution failure: the `vllm` container cannot reach `otel-collector` by hostname. Ensure both services are on the same Docker network (the default compose network handles this) and that `VLLM_OTLP_TRACES_ENDPOINT` uses the service name, not `localhost`.

**The `filter/pii` processor drops all spans.** The filter expression syntax is strict. Test your filter with `otelcol validate --config=otel-collector-config.yaml` before deploying. A malformed OTTL expression causes the collector to drop the entire batch silently in some versions; upgrade to `otel/opentelemetry-collector-contrib:0.102.0` or later where filter errors surface as log warnings.

**SigNoz UI shows spans but `gen_ai.*` attributes are missing.** vLLM's OTel integration requires `--otlp-traces-endpoint` to be set at startup, not just the `VLLM_OTLP_TRACES_ENDPOINT` environment variable. Pass the flag explicitly in the `command:` block of the compose file as shown in Step 2.

**Model download stalls or fails inside the container.** Hetzner nodes have outbound internet access, but HuggingFace rate-limits unauthenticated downloads. Set `HUGGING_FACE_HUB_TOKEN` in the vLLM service's `environment:` block using a Hetzner secret or Docker secret, and ensure the model is not gated (Qwen2.5-1.5B-Instruct is publicly available without a license gate).

**Audit log file grows unbounded.** The `file` exporter's `rotation` block requires `opentelemetry-collector-contrib` version 0.96.0 or later. On older images the rotation keys are silently ignored. Pin the collector image to `0.102.0` as shown in the compose file.

## Next steps

- **Add W&B or MLflow experiment tracking.** The span's `gen_ai.usage.prompt_tokens` and `gen_ai.usage.completion_tokens` attributes map directly to cost metrics. A small Python consumer reading the audit JSONL file can push per-request cost to any experiment tracker without touching the inference path.
- **Enable vLLM's prefix caching and measure cache hit rate.** Pass `--enable-prefix-caching` to vLLM and watch the `vllm.cache_hit_rate` metric in SigNoz. For multi-turn agent workloads with fixed system prompts, cache hit rates above 60% are achievable and directly reduce TTFT.
- **Swap in a QLoRA fine-tuned model.** The research in [1] shows that fine-tuned 1.5-4B models can outperform larger prompted baselines on tool-use tasks while running faster. Upload your fine-tuned adapter to a private HuggingFace repo, mount it into the vLLM container, and pass `--lora-modules` to serve it alongside the base model. The OTel spans will carry the served model name, letting you A/B compare base vs. fine-tuned latency and token counts in SigNoz without any instrumentation changes.
- **Set up Grafana Tempo as an alternative trace backend.** The same OTel Collector config works with Tempo by changing the `otlp/signoz` exporter endpoint to your Tempo instance. This is useful if your team already operates a Grafana stack and wants to consolidate dashboards.

## FAQ

### Why run LLM inference on Hetzner instead of a managed API?

Hetzner operates EU-only data centers with no US parent company, ensuring LLM inference logs and prompt text remain within EU borders to satisfy GDPR Article 44 data-residency requirements. Managed inference APIs from US providers route logs through US control planes by default, creating compliance risk.

### How does vLLM emit OpenTelemetry traces?

vLLM exposes an `--otlp-traces-endpoint` flag that sends structured spans following the OpenAI GenAI semantic conventions to an OTel Collector. Each span carries attributes like `gen_ai.usage.prompt_tokens`, `gen_ai.usage.completion_tokens`, and `gen_ai.response.finish_reasons`, enabling token-level observability.

### What does the OTel Collector do in this setup?

The Collector receives OTLP spans from vLLM, enriches them with Hetzner-specific resource attributes (region, cloud provider, environment), filters out PII patterns, and forwards them to SigNoz for querying. It also writes a local JSONL audit log for tamper-evident record-keeping.

### Can this setup handle multi-turn agent workloads?

Yes. The span-level token counts and child spans covering prefill, CUDA execution, and detokenization let operators correlate TTFT regressions with cache-miss rates and token budgets. Enabling vLLM's prefix caching with `--enable-prefix-caching` can achieve cache hit rates above 60% for fixed system prompts in multi-turn conversations.

### What model size fits on a single Hetzner GPU node?

A 1.5B-parameter model like Qwen2.5-1.5B-Instruct runs on any GX instance (e.g., GX2-120 with NVIDIA A30). For 7B models in FP16 you need at least 16 GB VRAM. The tutorial uses 1.5B to demonstrate a practical, cost-efficient setup.

## References

1. https://arxiv.org/abs/2605.17774v1
