Why this matters
EU data-residency requirements are no longer optional for many regulated industries. GDPR Article 44 prohibits transferring personal data to third countries without adequate safeguards, and LLM inference logs, which contain prompt text and completions, are increasingly treated as personal data by data protection authorities. Managed inference APIs from US-headquartered providers route traffic through US regions by default, and even “EU endpoints” often replicate logs to US-based control planes.
Hetzner Cloud operates data centers in Nuremberg, Falkenstein, and Helsinki with no US parent company, making it a natural fit for sovereign-cloud LLM deployments. vLLM’s OpenAI-compatible server exposes an /metrics endpoint and supports OpenTelemetry tracing natively as of v0.4.x, but wiring those signals into a durable, queryable backend requires an explicit collector pipeline. Without it, operators running multi-turn agent workloads have no way to correlate TTFT regressions with cache-miss rates, token budgets, or downstream tool calls.
This tutorial builds that pipeline end-to-end: vLLM serving a small open-weight model, an OTel Collector sidecar, and SigNoz as the trace and metrics backend, all on a single Hetzner node with a Docker Compose manifest you can commit to your infrastructure repo.
Prerequisites
- A Hetzner Cloud account with access to GPU instances (CCX or GX series in any EU region)
- Docker Engine 24+ and Docker Compose v2 installed on the Hetzner node
- Python 3.11 or 3.12 on your local machine (for the verification scripts in this tutorial)
- Familiarity with vLLM CLI flags and the OpenAI chat completions API
- An SSH key added to your Hetzner project
curlandjqavailable on the node
Setup
Install the Python packages used for local verification and span inspection. These run on your laptop or CI machine, not on the Hetzner node itself.
uv pip install openai opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc requests
Export the node’s public IP and the ports you’ll expose. Adjust these to match your Hetzner firewall rules.
export HETZNER_NODE_IP="203.0.113.42"
export VLLM_PORT=8000
export OTEL_COLLECTOR_GRPC_PORT=4317
export SIGNOZ_UI_PORT=3301
export MODEL_ID="Qwen/Qwen2.5-1.5B-Instruct"
Step 1: Provision the Hetzner Node
Create a GPU-enabled server via the Hetzner Cloud Console or the hcloud CLI. The GX2-120 (NVIDIA A30) in Falkenstein (fsn1) is the smallest GPU instance available as of mid-2025. For a 7B-parameter model in FP16 you need at least 16 GB VRAM; for the 1.5B model used in this tutorial, any GX instance works.
# Run this on your local machine with hcloud CLI installed
# hcloud server create \
# --name vllm-eu-node \
# --type gx2-120 \
# --location fsn1 \
# --image ubuntu-24.04 \
# --ssh-key your-key-name
#
# After creation, SSH in:
# ssh root@<node-ip>
echo "Hetzner provisioning commands shown above — run locally with hcloud CLI"
Once on the node, install Docker and the NVIDIA Container Toolkit:
# Run on the Hetzner node after SSH
# curl -fsSL https://get.docker.com | sh
# distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
# curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
# gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
# curl -fsSL https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
# sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
# tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
# apt-get update && apt-get install -y nvidia-container-toolkit
# nvidia-ctk runtime configure --runtime=docker
# systemctl restart docker
echo "Node setup commands shown above — run on the Hetzner node"
Step 2: Write the Docker Compose Stack
The stack has three services: vllm, otel-collector, and signoz. SigNoz bundles ClickHouse, its query service, and the UI into a single compose file that you embed as a Git submodule or copy inline. For clarity this tutorial uses a minimal SigNoz deployment.
Write the main compose file:
# filename: docker-compose.yml (save on the Hetzner node at /opt/vllm-stack/)
version: "3.9"
services:
vllm:
image: vllm/vllm-openai:latest
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- VLLM_OTLP_TRACES_ENDPOINT=http://otel-collector:4317
- VLLM_TRACE_FUNCTION=1
command: >
--model Qwen/Qwen2.5-1.5B-Instruct
--served-model-name qwen2.5-1.5b
--max-model-len 8192
--enable-chunked-prefill
--otlp-traces-endpoint http://otel-collector:4317
ports:
- "8000:8000"
depends_on:
- otel-collector
volumes:
- hf-cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
otel-collector:
image: otel/opentelemetry-collector-contrib:0.102.0
volumes:
- ./otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml
ports:
- "4317:4317"
- "4318:4318"
- "8888:8888"
command: ["--config=/etc/otelcol-contrib/config.yaml"]
depends_on:
- signoz
signoz:
image: signoz/signoz:latest
ports:
- "3301:3301"
environment:
- SIGNOZ_TELEMETRY_ENABLED=false
volumes:
- signoz-data:/var/lib/signoz
volumes:
hf-cache:
signoz-data:
Now write the OTel Collector configuration. This is the critical piece: it receives OTLP spans from vLLM, enriches them with resource attributes, and forwards them to SigNoz’s OTLP ingest endpoint.
# filename: otel-collector-config.yaml (save alongside docker-compose.yml)
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 512
resource:
attributes:
- key: deployment.region
value: "hetzner-fsn1"
action: upsert
- key: deployment.environment
value: "production"
action: upsert
- key: cloud.provider
value: "hetzner"
action: upsert
- key: cloud.region
value: "eu-central"
action: upsert
# Drop any spans that accidentally contain PII field names
# Adjust the pattern to match your data classification policy
filter/pii:
error_mode: ignore
traces:
span:
- 'attributes["user.email"] != nil'
exporters:
otlp/signoz:
endpoint: signoz:4317
tls:
insecure: true
# Secondary: write a JSON line per span to a local audit log
file:
path: /tmp/audit-spans.jsonl
rotation:
max_megabytes: 100
max_days: 30
max_backups: 10
debug:
verbosity: basic
service:
pipelines:
traces:
receivers: [otlp]
processors: [resource, filter/pii, batch]
exporters: [otlp/signoz, file, debug]
metrics:
receivers: [otlp]
processors: [resource, batch]
exporters: [otlp/signoz]
The file exporter writes one JSON line per span to a local audit log, giving you a tamper-evident record that satisfies Article 30 record-keeping obligations without shipping data off the node.
Step 3: Understand the vLLM Span Structure
vLLM emits OpenTelemetry traces for each request through its --otlp-traces-endpoint flag. Each top-level span covers the full request lifecycle and carries semantic attributes from the OpenTelemetry GenAI semantic conventions [1]:
gen_ai.system: always"vllm"gen_ai.request.model: the served model namegen_ai.request.max_tokens: from the request bodygen_ai.usage.prompt_tokensandgen_ai.usage.completion_tokens: token countsgen_ai.response.finish_reasons: array of finish reasons
Child spans cover prefill scheduling, CUDA kernel execution, and detokenization. The collector’s resource processor adds your Hetzner-specific attributes so every span in SigNoz is queryable by region and environment without modifying vLLM’s source.
The QLoRA fine-tuning research in [1] shows that smaller models like Qwen3-4B run 2.5x faster than Gemma 4B equivalents while using 62% less memory. That makes a 1.5B-parameter Qwen2.5 model a practical choice for a single-GPU Hetzner node, and the span-level token counts let you verify those efficiency claims against your own workload in real time.
Step 4: Deploy and Smoke-Test the Stack
On the Hetzner node, start the stack:
# Run on the Hetzner node
# cd /opt/vllm-stack
# docker compose up -d
# docker compose ps
# docker compose logs vllm --tail 40
echo "Docker Compose commands shown above — run on the Hetzner node after saving the compose files"
vLLM downloads the model weights on first start (roughly 3 GB for Qwen2.5-1.5B). The hf-cache volume persists them across restarts. Wait for the log line Application startup complete before sending requests.
Step 5: Send a Traced Request and Inspect the Span
This Python script sends a chat completion request to vLLM’s OpenAI-compatible endpoint and prints the response. In a real deployment, replace localhost with your node IP and ensure your Hetzner firewall allows port 8000 from your client IP.
# filename: send_request.py
import os
import json
from openai import OpenAI
def send_traced_request(base_url: str, model: str, prompt: str) -> dict:
client = OpenAI(
base_url=base_url,
api_key="not-needed-for-local",
)
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt},
],
max_tokens=128,
temperature=0.1,
extra_headers={
# Propagate a trace context so your client spans link to vLLM spans
"X-Request-ID": "audit-demo-001",
},
)
return {
"id": response.id,
"model": response.model,
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
"finish_reason": response.choices[0].finish_reason,
"content": response.choices[0].message.content,
}
if __name__ == "__main__":
node_ip = os.environ.get("HETZNER_NODE_IP", "localhost")
result = send_traced_request(
base_url=f"http://{node_ip}:8000/v1",
model="qwen2.5-1.5b",
prompt="List three EU data-residency best practices for LLM deployments.",
)
print(json.dumps(result, indent=2))
Step 6: Build a Local OTel Verification Harness
This block demonstrates the span structure your collector will receive, using an in-process OTel SDK with a console exporter. It does not require the Hetzner node to be running and executes entirely in the sandbox.
# filename: span_demo.py
import json
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource
def build_provider(service_name: str, region: str) -> TracerProvider:
resource = Resource.create({
"service.name": service_name,
"cloud.provider": "hetzner",
"cloud.region": region,
"deployment.environment": "production",
})
provider = TracerProvider(resource=resource)
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
return provider
def simulate_vllm_span(
tracer: trace.Tracer,
model: str,
prompt_tokens: int,
completion_tokens: int,
finish_reason: str,
) -> str:
with tracer.start_as_current_span("llm.generate") as span:
span.set_attribute("gen_ai.system", "vllm")
span.set_attribute("gen_ai.request.model", model)
span.set_attribute("gen_ai.usage.prompt_tokens", prompt_tokens)
span.set_attribute("gen_ai.usage.completion_tokens", completion_tokens)
span.set_attribute("gen_ai.response.finish_reasons", [finish_reason])
span.set_attribute("cloud.region", "eu-central")
span.set_attribute("cloud.provider", "hetzner")
ctx = trace.get_current_span().get_span_context()
return format(ctx.trace_id, "032x")
Now run the demo and capture the span output:
import io
import sys
from span_demo import build_provider, simulate_vllm_span
from opentelemetry import trace
buf = io.StringIO()
sys.stdout = buf
provider = build_provider("vllm-eu", "hetzner-fsn1")
tracer = provider.get_tracer("vllm-demo")
trace_id = simulate_vllm_span(
tracer,
model="qwen2.5-1.5b",
prompt_tokens=47,
completion_tokens=128,
finish_reason="stop",
)
sys.stdout = sys.__stdout__
output = buf.getvalue()
assert "gen_ai.system" in output, "Missing gen_ai.system attribute"
assert "vllm" in output, "Missing vllm system value"
assert "hetzner" in output, "Missing cloud.provider attribute"
assert "eu-central" in output, "Missing cloud.region attribute"
assert "prompt_tokens" in output, "Missing prompt_tokens"
print("span_structure_verified")
print(f"trace_id_length={len(trace_id)}")
print(f"trace_id_sample={trace_id[:8]}...")
Step 7: Configure Hetzner Firewall Rules
Lock down the node so only your CIDR block reaches the inference and collector ports. The SigNoz UI (3301) should be accessible only from your office or VPN IP.
# Run with hcloud CLI from your local machine
# Replace 198.51.100.0/24 with your actual CIDR
# hcloud firewall create --name vllm-stack-fw
# hcloud firewall add-rule vllm-stack-fw \
# --direction in --protocol tcp --port 22 \
# --source-ips 198.51.100.0/24
# hcloud firewall add-rule vllm-stack-fw \
# --direction in --protocol tcp --port 8000 \
# --source-ips 198.51.100.0/24
# hcloud firewall add-rule vllm-stack-fw \
# --direction in --protocol tcp --port 3301 \
# --source-ips 198.51.100.0/24
# hcloud firewall add-rule vllm-stack-fw \
# --direction in --protocol tcp --port 4317 \
# --source-ips 198.51.100.0/24
# hcloud firewall apply-to-server vllm-stack-fw --server vllm-eu-node
echo "Firewall rules shown above — run locally with hcloud CLI"
Verify it works
This block verifies the local OTel SDK wiring and the span attribute schema without requiring the Hetzner node:
import importlib.metadata
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry import trace
# Print installed versions
for pkg in ["opentelemetry-sdk", "opentelemetry-exporter-otlp-proto-grpc", "openai"]:
try:
ver = importlib.metadata.version(pkg)
print(f"{pkg}=={ver}")
except importlib.metadata.PackageNotFoundError:
print(f"{pkg}==NOT_FOUND")
# Verify span attribute round-trip
resource = Resource.create({"service.name": "vllm-verify", "cloud.provider": "hetzner"})
provider = TracerProvider(resource=resource)
import io, sys
buf = io.StringIO()
sys.stdout = buf
provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
tracer = provider.get_tracer("verify")
with tracer.start_as_current_span("verify.span") as span:
span.set_attribute("gen_ai.system", "vllm")
span.set_attribute("gen_ai.usage.prompt_tokens", 10)
span.set_attribute("gen_ai.usage.completion_tokens", 20)
sys.stdout = sys.__stdout__
out = buf.getvalue()
assert "gen_ai.system" in out
assert "vllm" in out
assert "prompt_tokens" in out
print("verification_passed")
On the Hetzner node, check the audit log file the OTel Collector writes:
# Run on the Hetzner node
# docker compose exec otel-collector tail -n 5 /tmp/audit-spans.jsonl | jq '.resourceSpans[0].resource.attributes[] | select(.key == "cloud.provider")'
echo "Audit log check shown above — run on the Hetzner node"
Open http://<node-ip>:3301 in your browser to reach the SigNoz UI. Navigate to Traces, filter by service.name = vllm, and you should see one span per request with the full GenAI attribute set.
Troubleshooting
vLLM container exits immediately with CUDA error: no kernel image is available. The vLLM Docker image is compiled for specific CUDA compute capabilities. Check your GPU’s compute capability with nvidia-smi --query-gpu=compute_cap --format=csv,noheader and pull the matching vLLM image tag (e.g., vllm/vllm-openai:v0.5.0 for A30 GPUs on CUDA 12.1).
No spans appear in SigNoz after sending requests. Check the collector logs with docker compose logs otel-collector. The most common cause is a DNS resolution failure: the vllm container cannot reach otel-collector by hostname. Ensure both services are on the same Docker network (the default compose network handles this) and that VLLM_OTLP_TRACES_ENDPOINT uses the service name, not localhost.
The filter/pii processor drops all spans. The filter expression syntax is strict. Test your filter with otelcol validate --config=otel-collector-config.yaml before deploying. A malformed OTTL expression causes the collector to drop the entire batch silently in some versions; upgrade to otel/opentelemetry-collector-contrib:0.102.0 or later where filter errors surface as log warnings.
SigNoz UI shows spans but gen_ai.* attributes are missing. vLLM’s OTel integration requires --otlp-traces-endpoint to be set at startup, not just the VLLM_OTLP_TRACES_ENDPOINT environment variable. Pass the flag explicitly in the command: block of the compose file as shown in Step 2.
Model download stalls or fails inside the container. Hetzner nodes have outbound internet access, but HuggingFace rate-limits unauthenticated downloads. Set HUGGING_FACE_HUB_TOKEN in the vLLM service’s environment: block using a Hetzner secret or Docker secret, and ensure the model is not gated (Qwen2.5-1.5B-Instruct is publicly available without a license gate).
Audit log file grows unbounded. The file exporter’s rotation block requires opentelemetry-collector-contrib version 0.96.0 or later. On older images the rotation keys are silently ignored. Pin the collector image to 0.102.0 as shown in the compose file.
Next steps
- Add W&B or MLflow experiment tracking. The span’s
gen_ai.usage.prompt_tokensandgen_ai.usage.completion_tokensattributes map directly to cost metrics. A small Python consumer reading the audit JSONL file can push per-request cost to any experiment tracker without touching the inference path. - Enable vLLM’s prefix caching and measure cache hit rate. Pass
--enable-prefix-cachingto vLLM and watch thevllm.cache_hit_ratemetric in SigNoz. For multi-turn agent workloads with fixed system prompts, cache hit rates above 60% are achievable and directly reduce TTFT. - Swap in a QLoRA fine-tuned model. The research in [1] shows that fine-tuned 1.5-4B models can outperform larger prompted baselines on tool-use tasks while running faster. Upload your fine-tuned adapter to a private HuggingFace repo, mount it into the vLLM container, and pass
--lora-modulesto serve it alongside the base model. The OTel spans will carry the served model name, letting you A/B compare base vs. fine-tuned latency and token counts in SigNoz without any instrumentation changes. - Set up Grafana Tempo as an alternative trace backend. The same OTel Collector config works with Tempo by changing the
otlp/signozexporter endpoint to your Tempo instance. This is useful if your team already operates a Grafana stack and wants to consolidate dashboards.
FAQ
Why run LLM inference on Hetzner instead of a managed API?
Hetzner operates EU-only data centers with no US parent company, ensuring LLM inference logs and prompt text remain within EU borders to satisfy GDPR Article 44 data-residency requirements. Managed inference APIs from US providers route logs through US control planes by default, creating compliance risk.
How does vLLM emit OpenTelemetry traces?
vLLM exposes an --otlp-traces-endpoint flag that sends structured spans following the OpenAI GenAI semantic conventions to an OTel Collector. Each span carries attributes like gen_ai.usage.prompt_tokens, gen_ai.usage.completion_tokens, and gen_ai.response.finish_reasons, enabling token-level observability.
What does the OTel Collector do in this setup?
The Collector receives OTLP spans from vLLM, enriches them with Hetzner-specific resource attributes (region, cloud provider, environment), filters out PII patterns, and forwards them to SigNoz for querying. It also writes a local JSONL audit log for tamper-evident record-keeping.
Can this setup handle multi-turn agent workloads?
Yes. The span-level token counts and child spans covering prefill, CUDA execution, and detokenization let operators correlate TTFT regressions with cache-miss rates and token budgets. Enabling vLLM’s prefix caching with --enable-prefix-caching can achieve cache hit rates above 60% for fixed system prompts in multi-turn conversations.
What model size fits on a single Hetzner GPU node?
A 1.5B-parameter model like Qwen2.5-1.5B-Instruct runs on any GX instance (e.g., GX2-120 with NVIDIA A30). For 7B models in FP16 you need at least 16 GB VRAM. The tutorial uses 1.5B to demonstrate a practical, cost-efficient setup.