Prerequisites

  • Python 3.11 or later
  • Docker Engine 24+ with the NVIDIA Container Toolkit installed
  • An NVIDIA GPU (A10G, A100, or H100 recommended) or a GPU cloud instance (Lambda Labs, RunPod, CoreWeave)
  • A Hugging Face account and token with access to the model you intend to serve
  • Familiarity with bash and basic HTTP APIs
  • curl and jq available on your host machine

Setup

Install the Python dependencies used by the monitoring client. These run locally against the vLLM server’s HTTP API, so no GPU is required on the machine running the client.

uv pip install requests matplotlib numpy rich

Export the environment variables the scripts will reference. Replace the placeholder values with your own.

export HF_TOKEN="hf_your_token_here"
export VLLM_HOST="http://localhost:8000"
export MODEL_ID="mistralai/Mistral-7B-Instruct-v0.3"

Step 1: Write the vLLM Server Launch Script

vLLM exposes KV cache statistics through its /metrics Prometheus endpoint and, per-request, through the usage field in OpenAI-compatible completions responses when --enable-prefix-caching is active [1]. The flags below enable prefix caching and set a generous GPU memory utilization so the cache has room to grow across turns.

Save the following as launch_vllm.sh:

# filename: launch_vllm.sh
#!/usr/bin/env bash
# start a vLLM server configured for agentic prefix caching
set -euo pipefail

MODEL="${MODEL_ID:-mistralai/Mistral-7B-Instruct-v0.3}"
HF_TOKEN="${HF_TOKEN:-}"
GPU_MEM_UTIL="${GPU_MEM_UTIL:-0.90}"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"
PORT="${PORT:-8000}"

if [[ -z "$HF_TOKEN" ]]; then
  echo "ERROR: HF_TOKEN is not set." >&2
  exit 1
fi

echo "Launching vLLM server for model: $MODEL"
echo "  GPU memory utilization : $GPU_MEM_UTIL"
echo "  Max model length       : $MAX_MODEL_LEN tokens"
echo "  Prefix caching         : enabled"
echo "  Port                   : $PORT"

docker run --rm --gpus all \
  --name vllm-agentic \
  -p "${PORT}:8000" \
  -e "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
  -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
  vllm/vllm-openai:latest \
    --model "$MODEL" \
    --gpu-memory-utilization "$GPU_MEM_UTIL" \
    --max-model-len "$MAX_MODEL_LEN" \
    --enable-prefix-caching \
    --disable-log-requests \
    --port 8000

Key flags explained:

  • --enable-prefix-caching: activates vLLM’s radix-tree prefix cache. Tokens whose prefix hash matches a cached block are served from VRAM without recomputation.
  • --gpu-memory-utilization 0.90: reserves 90% of VRAM for the KV cache pool, maximising the number of blocks available for multi-turn sessions.
  • --max-model-len 8192: caps context length so the cache block table stays within budget on smaller GPUs.

To start the server on your GPU machine, run:

bash launch_vllm.sh

Step 2: Write the Prometheus Metrics Scraper

vLLM exposes a /metrics endpoint in Prometheus text format. The metric vllm:gpu_prefix_cache_hit_rate gives the rolling cache hit rate across all requests. Save the following as metrics_scraper.py:

# filename: metrics_scraper.py
"""metrics_scraper.py — parse vLLM Prometheus metrics for KV cache stats."""
from __future__ import annotations
import re
import requests
from dataclasses import dataclass, field
from typing import Optional


@dataclass
class CacheMetrics:
    gpu_prefix_cache_hit_rate: Optional[float] = None
    gpu_cache_usage_perc: Optional[float] = None
    num_running_requests: Optional[int] = None
    num_waiting_requests: Optional[int] = None


_PATTERNS = {
    "gpu_prefix_cache_hit_rate": re.compile(
        r'^vllm:gpu_prefix_cache_hit_rate\{[^}]*\}\s+([\d.eE+\-]+)', re.M
    ),
    "gpu_cache_usage_perc": re.compile(
        r'^vllm:gpu_cache_usage_perc\{[^}]*\}\s+([\d.eE+\-]+)', re.M
    ),
    "num_running": re.compile(
        r'^vllm:num_requests_running\{[^}]*\}\s+([\d.eE+\-]+)', re.M
    ),
    "num_waiting": re.compile(
        r'^vllm:num_requests_waiting\{[^}]*\}\s+([\d.eE+\-]+)', re.M
    ),
}


def scrape(host: str, timeout: float = 5.0) -> CacheMetrics:
    """Fetch /metrics from a running vLLM server and parse KV cache fields."""
    url = f"{host.rstrip('/')}/metrics"
    resp = requests.get(url, timeout=timeout)
    resp.raise_for_status()
    text = resp.text

    def _float(key: str) -> Optional[float]:
        m = _PATTERNS[key].search(text)
        return float(m.group(1)) if m else None

    return CacheMetrics(
        gpu_prefix_cache_hit_rate=_float("gpu_prefix_cache_hit_rate"),
        gpu_cache_usage_perc=_float("gpu_cache_usage_perc"),
        num_running_requests=int(_float("num_running") or 0),
        num_waiting_requests=int(_float("num_waiting") or 0),
    )

Step 3: Write the Multi-Turn Session Simulator

Agentic workloads replay long shared prefixes on every turn, which is exactly where prefix caching pays off and where cache misses cause the TTFT regressions documented in [1]. The simulator below sends a series of chat completions that grow a shared system prompt across turns, records the per-request prompt token count (a proxy for cache reuse when the server logs num_cached_tokens), and scrapes the Prometheus hit-rate after each turn. Save the following as session_simulator.py:

# filename: session_simulator.py
"""session_simulator.py — simulate a multi-turn agentic session and collect KV cache metrics."""
from __future__ import annotations
import os
import time
import json
import requests
from dataclasses import dataclass, field
from typing import List, Dict, Any

from metrics_scraper import scrape, CacheMetrics


@dataclass
class TurnRecord:
    turn: int
    prompt_tokens: int
    completion_tokens: int
    latency_s: float
    cache_hit_rate: float
    cache_usage_perc: float


# A realistic agentic system prompt that stays constant across turns.
# In production this would be a tool schema + retrieved context block.
SYSTEM_PROMPT = """
You are a precise coding assistant. You have access to the following tools:

1. search_codebase(query: str) -> List[str]: Returns file paths matching query.
2. read_file(path: str) -> str: Returns file contents.
3. write_file(path: str, content: str) -> bool: Writes content to path.
4. run_tests(test_path: str) -> dict: Runs tests and returns results.
5. git_diff() -> str: Returns current working-tree diff.

Always reason step by step before calling a tool. Prefer minimal diffs.
When uncertain, ask a clarifying question rather than guessing.
""".strip()

# Simulated user turns that build on each other (agentic pattern).
USER_TURNS = [
    "List all Python files in the repository.",
    "Read the contents of src/main.py and summarize what it does.",
    "Find all functions in src/main.py that lack docstrings.",
    "Add a docstring to the `process_batch` function. Show me the diff.",
    "Run the unit tests for src/main.py and report any failures.",
    "Fix the first failing test and show the corrected code.",
    "Commit the changes with an appropriate message. What would you write?",
    "Now check if there are similar undocumented functions in src/utils.py.",
]


def chat_completion(
    host: str,
    model: str,
    messages: List[Dict[str, str]],
    max_tokens: int = 256,
) -> Dict[str, Any]:
    url = f"{host.rstrip('/')}/v1/chat/completions"
    payload = {
        "model": model,
        "messages": messages,
        "max_tokens": max_tokens,
        "temperature": 0.0,
    }
    resp = requests.post(url, json=payload, timeout=120)
    resp.raise_for_status()
    return resp.json()


def run_session(
    host: str,
    model: str,
    turns: List[str] = USER_TURNS,
    inter_turn_delay: float = 0.5,
) -> List[TurnRecord]:
    messages: List[Dict[str, str]] = [
        {"role": "system", "content": SYSTEM_PROMPT}
    ]
    records: List[TurnRecord] = []

    for i, user_msg in enumerate(turns):
        messages.append({"role": "user", "content": user_msg})

        t0 = time.perf_counter()
        result = chat_completion(host, model, messages, max_tokens=256)
        latency = time.perf_counter() - t0

        usage = result.get("usage", {})
        prompt_tokens = usage.get("prompt_tokens", 0)
        completion_tokens = usage.get("completion_tokens", 0)

        # Append assistant reply so next turn sees full history.
        assistant_content = result["choices"][0]["message"]["content"]
        messages.append({"role": "assistant", "content": assistant_content})

        # Scrape Prometheus metrics right after the request completes.
        try:
            cm: CacheMetrics = scrape(host)
            hit_rate = cm.gpu_prefix_cache_hit_rate or 0.0
            usage_perc = cm.gpu_cache_usage_perc or 0.0
        except Exception:
            hit_rate, usage_perc = 0.0, 0.0

        rec = TurnRecord(
            turn=i + 1,
            prompt_tokens=prompt_tokens,
            completion_tokens=completion_tokens,
            latency_s=latency,
            cache_hit_rate=hit_rate,
            cache_usage_perc=usage_perc,
        )
        records.append(rec)
        print(
            f"Turn {rec.turn:2d} | prompt_tokens={rec.prompt_tokens:5d} "
            f"| latency={rec.latency_s:.2f}s "
            f"| cache_hit_rate={rec.cache_hit_rate:.3f} "
            f"| cache_usage={rec.cache_usage_perc:.3f}"
        )
        time.sleep(inter_turn_delay)

    return records

Step 4: Write the Plotting and Reporting Module

This module takes the list of TurnRecord objects and produces two charts: cache hit rate per turn and prompt token growth (which shows how much prefix is being reused). It also prints a Rich summary table. Save the following as reporter.py:

# filename: reporter.py
"""reporter.py — plot KV cache efficiency metrics from a simulated session."""
from __future__ import annotations
import os
from typing import List

import matplotlib
matplotlib.use("Agg")  # headless backend for servers
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np

try:
    from rich.console import Console
    from rich.table import Table
    _RICH = True
except ImportError:
    _RICH = False

from session_simulator import TurnRecord


def print_summary_table(records: List[TurnRecord]) -> None:
    if not _RICH:
        for r in records:
            print(r)
        return
    console = Console()
    table = Table(title="KV Cache Efficiency — Multi-Turn Session", show_lines=True)
    table.add_column("Turn", justify="right")
    table.add_column("Prompt Tokens", justify="right")
    table.add_column("Completion Tokens", justify="right")
    table.add_column("Latency (s)", justify="right")
    table.add_column("Cache Hit Rate", justify="right")
    table.add_column("Cache Usage", justify="right")
    for r in records:
        hit_color = "green" if r.cache_hit_rate > 0.5 else "yellow" if r.cache_hit_rate > 0.2 else "red"
        table.add_row(
            str(r.turn),
            str(r.prompt_tokens),
            str(r.completion_tokens),
            f"{r.latency_s:.2f}",
            f"[{hit_color}]{r.cache_hit_rate:.3f}[/{hit_color}]",
            f"{r.cache_usage_perc:.3f}",
        )
    console.print(table)


def plot_session(records: List[TurnRecord], output_path: str = "cache_efficiency.png") -> str:
    turns = [r.turn for r in records]
    hit_rates = [r.cache_hit_rate for r in records]
    prompt_tokens = [r.prompt_tokens for r in records]
    latencies = [r.latency_s for r in records]

    fig = plt.figure(figsize=(12, 8))
    fig.suptitle("vLLM KV Cache Efficiency — Agentic Multi-Turn Session", fontsize=14, fontweight="bold")
    gs = gridspec.GridSpec(2, 2, figure=fig, hspace=0.4, wspace=0.35)

    # Panel 1: Cache hit rate over turns
    ax1 = fig.add_subplot(gs[0, 0])
    ax1.plot(turns, hit_rates, marker="o", color="steelblue", linewidth=2)
    ax1.axhline(0.5, color="orange", linestyle="--", linewidth=1, label="50% threshold")
    ax1.set_xlabel("Turn")
    ax1.set_ylabel("Cache Hit Rate")
    ax1.set_title("Prefix Cache Hit Rate per Turn")
    ax1.set_ylim(0, 1.05)
    ax1.legend(fontsize=8)
    ax1.grid(True, alpha=0.3)

    # Panel 2: Prompt token growth
    ax2 = fig.add_subplot(gs[0, 1])
    ax2.bar(turns, prompt_tokens, color="mediumseagreen", alpha=0.8)
    ax2.set_xlabel("Turn")
    ax2.set_ylabel("Prompt Tokens")
    ax2.set_title("Prompt Token Count (prefix growth)")
    ax2.grid(True, alpha=0.3, axis="y")

    # Panel 3: Latency over turns
    ax3 = fig.add_subplot(gs[1, 0])
    ax3.plot(turns, latencies, marker="s", color="tomato", linewidth=2)
    ax3.set_xlabel("Turn")
    ax3.set_ylabel("Latency (s)")
    ax3.set_title("End-to-End Request Latency")
    ax3.grid(True, alpha=0.3)

    # Panel 4: Estimated tokens saved by cache
    ax4 = fig.add_subplot(gs[1, 1])
    tokens_saved = [int(r.prompt_tokens * r.cache_hit_rate) for r in records]
    ax4.bar(turns, tokens_saved, color="mediumpurple", alpha=0.8)
    ax4.set_xlabel("Turn")
    ax4.set_ylabel("Tokens Saved")
    ax4.set_title("Estimated Prompt Tokens Served from Cache")
    ax4.grid(True, alpha=0.3, axis="y")

    plt.savefig(output_path, dpi=150, bbox_inches="tight")
    plt.close(fig)
    return output_path

Step 5: Write the Main Entry Point

This script ties everything together. When a live vLLM server is not reachable, it falls back to synthetic data so you can verify the plotting pipeline locally without a GPU. Save the following as monitor_cache.py:

# filename: monitor_cache.py
"""monitor_cache.py — entry point for KV cache monitoring."""
from __future__ import annotations
import os
import sys
import random
import math

VLLM_HOST = os.environ.get("VLLM_HOST", "http://localhost:8000")
MODEL_ID = os.environ.get("MODEL_ID", "mistralai/Mistral-7B-Instruct-v0.3")


def _synthetic_records(n_turns: int = 8):
    """Generate plausible synthetic TurnRecords for offline testing."""
    from session_simulator import TurnRecord
    records = []
    base_tokens = 180  # system prompt tokens
    for i in range(1, n_turns + 1):
        prompt_tokens = base_tokens + i * 95 + random.randint(-10, 10)
        # Hit rate rises as the shared prefix grows — mirrors real agentic behaviour.
        hit_rate = min(0.95, 0.05 + 0.12 * i + random.uniform(-0.03, 0.03))
        # Latency drops as cache warms up (fewer prefill FLOPs).
        latency = max(0.4, 3.5 - hit_rate * 2.8 + random.uniform(-0.1, 0.1))
        records.append(TurnRecord(
            turn=i,
            prompt_tokens=prompt_tokens,
            completion_tokens=random.randint(60, 200),
            latency_s=round(latency, 3),
            cache_hit_rate=round(hit_rate, 4),
            cache_usage_perc=round(min(0.85, 0.05 * i + random.uniform(0, 0.02)), 4),
        ))
    return records


def _server_reachable(host: str) -> bool:
    import requests
    try:
        r = requests.get(f"{host}/health", timeout=3)
        return r.status_code == 200
    except Exception:
        return False


def main():
    from reporter import print_summary_table, plot_session

    if _server_reachable(VLLM_HOST):
        print(f"vLLM server reachable at {VLLM_HOST}. Running live session...")
        from session_simulator import run_session
        records = run_session(VLLM_HOST, MODEL_ID)
    else:
        print(
            f"vLLM server not reachable at {VLLM_HOST}. "
            "Using synthetic data for demonstration."
        )
        records = _synthetic_records(n_turns=8)

    print_summary_table(records)
    out = plot_session(records)
    print(f"Chart saved to: {out}")
    print("monitoring_complete")


if __name__ == "__main__":
    main()

Verify it Works

Run the monitoring script from the directory where you saved the five files. Because no vLLM server is running yet, it automatically uses synthetic data that mirrors the cache warm-up curve you would observe in a real agentic session. The chart is written to cache_efficiency.png in the same directory.

python monitor_cache.py

Verify the chart file was written:

ls -lh cache_efficiency.png

When you connect to a live GPU instance, start the server and run the monitor in two terminals:

# Terminal 1 — start vLLM
export HF_TOKEN=hf_...
export MODEL_ID=mistralai/Mistral-7B-Instruct-v0.3
bash launch_vllm.sh

# Terminal 2 — run the monitor once the server is healthy
python monitor_cache.py

Reading the Output

The four-panel chart shows:

  1. Prefix Cache Hit Rate per Turn: should climb from near 0 on turn 1 (cold cache) toward 0.7-0.9 by turn 5+ as the shared system prompt and conversation history fill the radix tree. The Irminsul paper [1] reports recovery of up to 83% of prompt tokens above exact-prefix on agentic traffic, so values in this range are expected on a warm cache.
  2. Prompt Token Count: grows linearly because each turn appends the assistant reply. A flat or slow-growing curve here means the conversation history is being truncated, which would also reset the cache.
  3. End-to-End Latency: should decrease as the hit rate rises. Cache misses force full prefill recomputation, which is the source of the 10-16 second TTFT spikes documented in [1].
  4. Tokens Served from Cache: the product of prompt tokens and hit rate. This is the prefill work the GPU skipped. At 63% prefill energy savings per cache hit [1], this panel translates directly to cost reduction.

Troubleshooting

docker: Error response from daemon: could not select device driver "nvidia": The NVIDIA Container Toolkit is not installed or not configured. Follow the NVIDIA Container Toolkit installation guide and run sudo nvidia-ctk runtime configure --runtime=docker && sudo systemctl restart docker.

CUDA out of memory during server startup: Reduce --gpu-memory-utilization to 0.80 or lower --max-model-len to 4096. On a 24 GB GPU, Mistral-7B needs roughly 14 GB for weights, leaving 10 GB for the KV cache pool.

Cache hit rate stays at 0.0 after several turns: Confirm --enable-prefix-caching is present in the Docker command. Also check that temperature is set to 0.0 in requests; sampling with temperature > 0 does not affect caching, but verify the model is not being reloaded between requests by checking Docker logs for Loading model weights.

/metrics returns 404: Some older vLLM images disable the Prometheus endpoint by default. Add --enable-metrics to the Docker command, or upgrade to vllm/vllm-openai:latest.

Latency does not decrease despite rising hit rate: The cache is warming but the bottleneck may be decode (token generation), not prefill. Use --max-tokens 64 in your test requests to isolate prefill latency, or inspect vllm:time_to_first_token_seconds in the Prometheus output directly.

ModuleNotFoundError: No module named 'metrics_scraper': Run monitor_cache.py from the directory where you saved the five files, or add that directory to PYTHONPATH: PYTHONPATH=. python monitor_cache.py.

Next Steps

  • Integrate with Grafana: scrape the vLLM /metrics endpoint with a Prometheus server and build a dashboard that alerts when gpu_prefix_cache_hit_rate drops below 0.4 for more than 60 seconds.
  • Test with MLA-based models: DeepSeek-V2-Lite and Kimi Moonlight are the models evaluated in [1]. Their Multi-Head Latent Attention architecture makes content-addressed caching especially effective; swap MODEL_ID to one of these and compare hit-rate curves against a GQA model.
  • Benchmark cache miss cost: add a --no-enable-prefix-caching run of the same session and compare TTFT distributions. This quantifies the latency penalty your users pay when cache eviction occurs under load.
  • Extend to tool-call traces: replace USER_TURNS with real tool-call/result pairs from your agent framework. The system prompt plus tool schemas form a long shared prefix that is the primary beneficiary of prefix caching in production agentic deployments.