Build a five-agent economy simulation where each agent runs LLM calls against a local vLLM server, and every decision is captured as an OpenTelemetry span viewable in Phoenix. The result is a reusable trace harness for any multi-agent system you deploy.

Why this matters

The Thousand Token Wood project [1] demonstrated something operationally important: a 3B model served with vLLM can power a real-time multi-agent simulation at a cost and latency that frontier models can’t match. Every creature decides in a single batched GPU call per turn [1]. But running many agents in parallel creates a new problem: when something goes wrong (an agent posts a buy order for the good it produces, prices crash, one agent dominates), you have no visibility into which LLM call produced the bad decision, what the prompt looked like, or how long each agent took.

OpenTelemetry spans solve this. Each agent’s LLM call becomes a child span under a parent simulation-tick span, so you can trace the full causal chain from tick to agent to prompt to response. Without this, debugging emergent behavior in a multi-agent system means adding print statements and re-running, which is slow and doesn’t scale to production deployments.

This tutorial builds the simulation and the trace harness together, so you can adapt both for your own projects.

Prerequisites

  • Python 3.11 or 3.12
  • A machine with a CUDA-capable GPU, OR a vLLM-compatible endpoint you can point the client at (Modal, RunPod, etc.)
  • Familiarity with async Python (asyncio, await)
  • Basic OpenTelemetry concepts (spans, traces, exporters) are helpful but not required
  • Arize Phoenix running locally for trace visualization (optional for the sandbox portion; the tutorial shows console export as a fallback)

Setup

Install the required packages. The simulation uses openai as the vLLM client (vLLM exposes an OpenAI-compatible API), opentelemetry-sdk for instrumentation, and arize-phoenix for the local trace UI.

uv pip install openai opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc arize-phoenix openinference-instrumentation-openai

Export your vLLM endpoint. In the sandbox the simulation runs against a mock server defined later in this tutorial. On your own machine, point this at your actual vLLM instance:

export VLLM_BASE_URL="http://localhost:8000/v1"
export VLLM_MODEL="Qwen/Qwen2.5-3B-Instruct"
export OTEL_SERVICE_NAME="thousand-token-wood"

Step 1: Build a mock vLLM server for local testing

Because the sandbox has no GPU, you’ll build a lightweight FastAPI server that mimics the OpenAI-compatible vLLM API. On your own machine with a real vLLM instance, skip this step and point VLLM_BASE_URL at your server.

uv pip install fastapi uvicorn
# filename: mock_vllm.py
import json
import random
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse

app = FastAPI()

GOODS = ["acorns", "berries", "mushrooms", "firewood", "honey"]

AGENT_ROLES = {
    "squirrel": {"produces": "acorns", "needs": ["berries", "firewood"]},
    "rabbit":   {"produces": "berries", "needs": ["mushrooms", "firewood"]},
    "fox":      {"produces": "mushrooms", "needs": ["acorns", "honey"]},
    "beaver":   {"produces": "firewood", "needs": ["acorns", "berries"]},
    "bear":     {"produces": "honey", "needs": ["mushrooms", "firewood"]},
}

def make_decision(agent_name: str) -> dict:
    role = AGENT_ROLES.get(agent_name, AGENT_ROLES["squirrel"])
    buy_good = random.choice(role["needs"])
    sell_good = role["produces"]
    buy_price = round(random.uniform(1.0, 5.0), 2)
    sell_price = round(random.uniform(1.0, 5.0), 2)
    return {
        "buy": {"good": buy_good, "price": buy_price, "quantity": random.randint(1, 3)},
        "sell": {"good": sell_good, "price": sell_price, "quantity": random.randint(1, 3)},
        "mood": round(random.uniform(0.3, 1.0), 2),
    }

@app.get("/health")
def health():
    return {"status": "ok"}

@app.get("/v1/models")
def list_models():
    return {
        "object": "list",
        "data": [{"id": "Qwen/Qwen2.5-3B-Instruct", "object": "model"}],
    }

@app.post("/v1/chat/completions")
async def chat_completions(request: Request):
    body = await request.json()
    messages = body.get("messages", [])
    agent_name = "squirrel"
    for msg in messages:
        content = msg.get("content", "")
        for name in AGENT_ROLES:
            if name in content.lower():
                agent_name = name
                break
    decision = make_decision(agent_name)
    response_text = json.dumps(decision)
    return JSONResponse({
        "id": "chatcmpl-mock",
        "object": "chat.completion",
        "model": "Qwen/Qwen2.5-3B-Instruct",
        "choices": [{
            "index": 0,
            "message": {"role": "assistant", "content": response_text},
            "finish_reason": "stop",
        }],
        "usage": {"prompt_tokens": 120, "completion_tokens": 40, "total_tokens": 160},
    })

Launch the mock server in the background:

nohup uvicorn mock_vllm:app --host 0.0.0.0 --port 8000 > /tmp/mock_vllm.log 2>&1 & disown
sleep 2
curl -sf http://localhost:8000/health || (echo "mock server failed to start" >&2; cat /tmp/mock_vllm.log; exit 1)
echo "mock vllm server ready"

Override the endpoint to point at the mock:

export VLLM_BASE_URL="http://localhost:8000/v1"
export VLLM_MODEL="Qwen/Qwen2.5-3B-Instruct"

Step 2: Define the agent and economy model

Each agent holds an inventory, a pebble balance, and a mood score. The agent’s decide() method calls the LLM and parses the JSON response. The design follows the Thousand Token Wood approach [1]: each creature produces one good, needs others, and must buy firewood every turn as a survival cost.

# filename: economy.py
import json
import os
import re
from dataclasses import dataclass, field
from typing import Optional

from openai import AsyncOpenAI

GOODS = ["acorns", "berries", "mushrooms", "firewood", "honey"]

AGENT_CONFIGS = [
    {"name": "squirrel", "produces": "acorns",    "needs": ["berries", "firewood"]},
    {"name": "rabbit",   "produces": "berries",   "needs": ["mushrooms", "firewood"]},
    {"name": "fox",      "produces": "mushrooms", "needs": ["acorns", "honey"]},
    {"name": "beaver",   "produces": "firewood",  "needs": ["acorns", "berries"]},
    {"name": "bear",     "produces": "honey",     "needs": ["mushrooms", "firewood"]},
]


@dataclass
class Agent:
    name: str
    produces: str
    needs: list
    inventory: dict = field(default_factory=lambda: {g: 3 for g in GOODS})
    pebbles: float = 10.0
    mood: float = 0.8

    def build_prompt(self, tick: int) -> str:
        inv_str = ", ".join(f"{g}:{v}" for g, v in self.inventory.items())
        needs_str = ", ".join(self.needs)
        return (
            f"You are {self.name}, a woodland creature in a small economy. "
            f"You produce {self.produces} (never buy it). "
            f"You need: {needs_str}. "
            f"Inventory: {inv_str}. Pebbles: {self.pebbles:.1f}. Mood: {self.mood:.2f}. "
            f"Tick: {tick}.\n"
            f"Respond with ONLY valid JSON: "
            '{"buy":{"good":"<name>","price":<float>,"quantity":<int>},'
            '"sell":{"good":"<name>","price":<float>,"quantity":<int>},'
            '"mood":<float 0-1>}'
        )

    def apply_decision(self, decision: dict) -> None:
        buy = decision.get("buy", {})
        sell = decision.get("sell", {})
        buy_good = buy.get("good", "")
        sell_good = sell.get("good", "")
        buy_qty = int(buy.get("quantity", 0))
        sell_qty = int(sell.get("quantity", 0))
        buy_price = float(buy.get("price", 0.0))
        sell_price = float(sell.get("price", 0.0))
        if buy_good in GOODS and buy_good != self.produces:
            cost = buy_price * buy_qty
            if cost <= self.pebbles:
                self.inventory[buy_good] = self.inventory.get(buy_good, 0) + buy_qty
                self.pebbles -= cost
        if sell_good in GOODS and self.inventory.get(sell_good, 0) >= sell_qty:
            self.inventory[sell_good] -= sell_qty
            self.pebbles += sell_price * sell_qty
        self.mood = float(decision.get("mood", self.mood))
        self.mood = max(0.1, min(1.0, self.mood))


def parse_decision(raw: str) -> Optional[dict]:
    """Extract JSON from LLM response, tolerating extra text."""
    try:
        return json.loads(raw)
    except json.JSONDecodeError:
        pass
    match = re.search(r'\{.*\}', raw, re.DOTALL)
    if match:
        try:
            return json.loads(match.group())
        except json.JSONDecodeError:
            pass
    return None


def make_client() -> AsyncOpenAI:
    base_url = os.environ.get("VLLM_BASE_URL", "http://localhost:8000/v1")
    return AsyncOpenAI(base_url=base_url, api_key="not-needed")

Step 3: Add OpenTelemetry instrumentation

The tracer creates a parent span per simulation tick and a child span per agent decision. Each span records the agent name, prompt token count, completion token count, and parsed mood. This structure lets you filter by agent or by tick in Phoenix.

Each agent’s LLM call becomes a child span under a parent simulation-tick span, so you can trace the full causal chain from tick to agent to prompt to response.

# filename: tracing.py
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource


def build_tracer_provider(use_console: bool = True) -> TracerProvider:
    resource = Resource.create({
        "service.name": os.environ.get("OTEL_SERVICE_NAME", "thousand-token-wood"),
    })
    provider = TracerProvider(resource=resource)
    if use_console:
        provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
    else:
        # OTLP export to Phoenix (grpc default: localhost:4317)
        try:
            from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
            from opentelemetry.sdk.trace.export import BatchSpanProcessor
            otlp = OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)
            provider.add_span_processor(BatchSpanProcessor(otlp))
        except Exception as exc:
            print(f"OTLP exporter unavailable, falling back to console: {exc}")
            provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
    trace.set_tracer_provider(provider)
    return provider


def get_tracer() -> trace.Tracer:
    return trace.get_tracer("thousand-token-wood")

Step 4: Write the simulation loop

The simulation runs a configurable number of ticks. Each tick, all five agents decide in parallel using asyncio.gather. Every LLM call is wrapped in a child span. Token usage and mood are recorded as span attributes, giving you per-agent cost and sentiment data across the full run.

# filename: simulation.py
import asyncio
import json
import os
import time
from opentelemetry import trace

from economy import Agent, AGENT_CONFIGS, GOODS, make_client, parse_decision
from tracing import build_tracer_provider, get_tracer


async def agent_decide(agent: Agent, tick: int, client, tracer: trace.Tracer) -> dict:
    model = os.environ.get("VLLM_MODEL", "Qwen/Qwen2.5-3B-Instruct")
    prompt = agent.build_prompt(tick)
    with tracer.start_as_current_span(f"agent.decide/{agent.name}") as span:
        span.set_attribute("agent.name", agent.name)
        span.set_attribute("agent.produces", agent.produces)
        span.set_attribute("agent.pebbles", agent.pebbles)
        span.set_attribute("simulation.tick", tick)
        t0 = time.perf_counter()
        try:
            response = await client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                temperature=0.7,
                max_tokens=128,
            )
            elapsed = time.perf_counter() - t0
            raw = response.choices[0].message.content
            usage = response.usage
            span.set_attribute("llm.prompt_tokens", usage.prompt_tokens if usage else 0)
            span.set_attribute("llm.completion_tokens", usage.completion_tokens if usage else 0)
            span.set_attribute("llm.latency_ms", round(elapsed * 1000, 1))
            decision = parse_decision(raw)
            if decision is None:
                span.set_attribute("parse.error", True)
                span.set_attribute("parse.raw", raw[:200])
                return {}
            span.set_attribute("agent.mood_after", float(decision.get("mood", agent.mood)))
            span.set_attribute("decision.buy_good", decision.get("buy", {}).get("good", ""))
            span.set_attribute("decision.sell_good", decision.get("sell", {}).get("good", ""))
            return decision
        except Exception as exc:
            span.record_exception(exc)
            span.set_attribute("error", True)
            return {}


async def run_simulation(ticks: int = 3, use_console_exporter: bool = True) -> list:
    provider = build_tracer_provider(use_console=use_console_exporter)
    tracer = get_tracer()
    client = make_client()
    agents = [Agent(**cfg) for cfg in AGENT_CONFIGS]
    history = []
    for tick in range(1, ticks + 1):
        with tracer.start_as_current_span(f"simulation.tick/{tick}") as tick_span:
            tick_span.set_attribute("simulation.tick", tick)
            decisions = await asyncio.gather(
                *[agent_decide(a, tick, client, tracer) for a in agents]
            )
            tick_state = {}
            for agent, decision in zip(agents, decisions):
                if decision:
                    agent.apply_decision(decision)
                tick_state[agent.name] = {
                    "pebbles": round(agent.pebbles, 2),
                    "mood": round(agent.mood, 2),
                    "inventory": dict(agent.inventory),
                }
            total_pebbles = sum(s["pebbles"] for s in tick_state.values())
            tick_span.set_attribute("economy.total_pebbles", round(total_pebbles, 2))
            history.append({"tick": tick, "state": tick_state})
            print(f"Tick {tick}: " + ", ".join(
                f"{n}={s['pebbles']}p" for n, s in tick_state.items()
            ))
    await client.close()
    return history

Step 5: Run the simulation

This entry point runs three ticks with the console span exporter so you can see the trace output inline. To send spans to Phoenix instead, set use_console_exporter=False after starting Phoenix locally.

import asyncio
from simulation import run_simulation

history = asyncio.run(run_simulation(ticks=3, use_console_exporter=True))
print("\n=== Final state ===")
for tick_record in history:
    tick = tick_record["tick"]
    for name, state in tick_record["state"].items():
        print(f"  tick={tick} {name}: {state['pebbles']} pebbles, mood={state['mood']}")
print("simulation_complete")

Step 6: Connect to Phoenix for visual trace inspection

Phoenix runs as a local process and accepts OTLP spans over gRPC on port 4317. Start it, then re-run the simulation with use_console_exporter=False.

# Run Phoenix in the background (requires arize-phoenix installed)
# Skip this block if you're in the sandbox — Phoenix needs a display or headless mode.
# On your own machine:
#   python -m phoenix.server.main &
#   open http://localhost:6006
echo "Phoenix launch is skipped in the sandbox; use console exporter output above."
echo "On your own machine: python -m phoenix.server.main &"
echo "Then set use_console_exporter=False in run_simulation()"

When Phoenix is running, the simulation’s spans appear under the thousand-token-wood service. Each tick is a root span with five child spans, one per agent. You can filter by agent.name, sort by llm.latency_ms, and spot which agent’s decisions had parse.error=True.

The same span structure indexes the same way on Datadog or Honeycomb. Only the exporter endpoint changes: replace the OTLPSpanExporter endpoint in tracing.py with your vendor’s OTLP ingest URL and set the appropriate auth header.

Verify it works

import asyncio
import json
from simulation import run_simulation

history = asyncio.run(run_simulation(ticks=2, use_console_exporter=True))

assert len(history) == 2, f"Expected 2 ticks, got {len(history)}"
for tick_record in history:
    state = tick_record["state"]
    assert set(state.keys()) == {"squirrel", "rabbit", "fox", "beaver", "bear"}, \
        f"Missing agents in tick {tick_record['tick']}"
    for name, s in state.items():
        assert "pebbles" in s and "mood" in s and "inventory" in s, \
            f"Incomplete state for {name}"
        assert 0.0 <= s["mood"] <= 1.0, f"Mood out of range for {name}: {s['mood']}"

print("all_assertions_passed")

Troubleshooting

Connection refused when calling the vLLM endpoint. The mock server or your real vLLM instance isn’t running. Check /tmp/mock_vllm.log for startup errors. For a real vLLM instance, verify with curl $VLLM_BASE_URL/health.

parse.error=True spans appear for every agent. The LLM is returning text that wraps the JSON in markdown fences or adds explanation. The parse_decision function in economy.py uses a regex fallback, but if the model adds substantial prose before the JSON, the regex may not match. Tighten the prompt’s instruction to output only JSON, or add a response_format={"type": "json_object"} parameter if your vLLM version supports it.

Agents buy the good they produce. This is the core small-model reasoning failure described in [1]. The fix is already in the prompt ("never buy it" instruction), but a 3B model may still violate it occasionally. Add a post-parse guard in apply_decision that discards buy orders where buy_good == self.produces.

Spans appear in the console but not in Phoenix. Phoenix must be running before the simulation starts, because the BatchSpanProcessor connects at startup. Restart the simulation after confirming Phoenix is up with curl http://localhost:6006/healthz.

asyncio.run() raises RuntimeError: This event loop is already running. You’re running inside a Jupyter notebook. Replace asyncio.run(run_simulation(...)) with await run_simulation(...) in a notebook cell, or use nest_asyncio.apply() at the top of the notebook.

Total pebbles drift significantly across ticks. The mock server generates random prices, so the economy isn’t conserved. In a real deployment with the actual Qwen2.5-3B model, pebbles are conserved by matching buy and sell orders in a central clearing step. Add a clear_market(agents, decisions) function that matches compatible orders before applying them.

Next steps

  • Add a market-clearing step. Collect all buy and sell orders from a tick, match them by good and price, and only execute matched trades. This makes the economy conserved and produces the price dynamics described in [1].
  • Instrument with openinference-instrumentation-openai. The openinference library auto-instruments the openai client and records prompt content, token counts, and model name as standard semantic conventions, reducing the manual span attribute code in agent_decide.
  • Deploy on Modal with real vLLM. The Thousand Token Wood project [1] serves Qwen2.5-3B via vLLM on Modal. Point VLLM_BASE_URL at a Modal-hosted endpoint and remove the mock server to run the simulation against the real model.
  • Add a Gradio dashboard. Expose the history list from run_simulation to a Gradio gr.JSON component updated each tick, giving you a live view of pebble balances and mood alongside the Phoenix trace view.

FAQ

How does OpenTelemetry tracing help debug multi-agent systems?

Each agent’s LLM call becomes a child span under a parent simulation-tick span, creating a causal chain from tick to agent to prompt to response. This structure lets you filter by agent or tick in Phoenix and identify which LLM call produced a bad decision without re-running with print statements.

What model and infrastructure does this tutorial use?

The tutorial uses Qwen2.5-3B served via vLLM on a local GPU or remote endpoint (Modal, RunPod). For testing in a sandbox without GPU, a mock FastAPI server mimics the OpenAI-compatible vLLM API.

What span attributes are recorded for each agent decision?

Each agent span records agent name, production good, pebble balance, tick number, LLM prompt and completion token counts, latency in milliseconds, parsed mood, and buy/sell goods. Parse errors and exceptions are also captured as span attributes.

How do agents decide what to buy and sell?

Each agent receives a prompt describing its role, inventory, pebble balance, and mood, then calls the LLM to return a JSON decision with buy order, sell order, and updated mood. The simulation parses the JSON and applies the decision to the agent’s state.

Can spans be sent to observability platforms other than Phoenix?

Yes. The same span structure works with any OTLP-compatible backend (Datadog, Honeycomb, etc.) by changing the exporter endpoint and auth header in the OTLPSpanExporter configuration in tracing.py.