Thousand Token Wood Runs Five Agents on Qwen2.5-3B

What Thousand Token Wood Is

Thousand Token Wood is a multi-agent economic simulation built for Hugging Face’s Build Small Hackathon. Five woodland-creature agents, each powered by Qwen2.5-3B, trade five goods for pebbles, gossip, hoard, and panic inside a small synthetic economy [1]. The model is served via vLLM on Modal, and a Gradio application provides the runtime interface through which observers watch market bubbles, crashes, and wealth concentration emerge without scripted direction [1].

The project is open: agent traces are published alongside the interactive Space. A second version, v2, rebuilt the simulation into an interactive game in which a human operator acts as a shadow financier, lending at interest, planting tips, shorting the market, and bribing participants while a magistrate character tracks suspicious activity [2]. The architectural change underneath that gameplay shift was equally significant: v2 replaced the single shared model with four distinct small models from different labs.

Why Small Models Are the Architecture, Not a Compromise

Running a council of agents that each decide on every simulation tick imposes strict latency and cost constraints. Frontier models are too slow and too costly to serve a group of traders on every turn, making real-time multi-agent simulation impractical at that tier [1]. A small model resolves both problems: in Thousand Token Wood, every creature’s decision executes in a single batched GPU call per turn [1].

The framing in the project’s own engineering report is direct: small is the design, not the limit. The choice of a 3B-class model was not a resource compromise forced by hackathon constraints but the correct tool for a workload that requires many agents thinking many times per run.

Engineering Scarcity and Emergent Behavior

The first implementation of the economy was, by the developer’s account, dead on arrival. Production outpaced consumption, every creature became self-sufficient, and the market cleared once and went silent [1]. No trade emerged because no agent had a reason to trade.

Three scarcity constraints fixed the problem. First, diet variety rules limited each creature to consuming only one unit of any single food per meal, forcing agents to buy goods they do not themselves produce [1]. Second, spoilage mechanics caused perishable food to rot if hoarded, which pushed surplus holders to sell before value decayed [1]. Third, a winter fuel crisis mechanic required every creature to burn firewood each turn, with demand rising over time and only one creature producing firewood [1].

That third constraint drives the simulation’s central drama. A single supplier cannot meet compounding demand, so the woodcutter accumulates wealth while all other agents compete for warmth. The wealth gap and price volatility that observers see are outputs of these designed constraints, not of model sophistication.

V2: Heterogeneous Models as Market Participants

Version 2 replaced the single-model architecture with four distinct models: gpt-oss-20b from OpenAI, MiniCPM3-4B from OpenBMB, Nemotron-Mini-4B from NVIDIA, and a fine-tuned Qwen 0.5B developed by the project’s author [2]. The rationale was that a market becomes interesting when participants genuinely differ, and models trained by different labs on different data with different post-training pipelines produce meaningfully different agent behavior. The owl hoards differently than the fox speculates, in the project’s own framing [2].

Standing four distinct models up on a single platform surfaced friction concentrated almost entirely at the serving layer rather than the modeling layer [2]. vLLM version 0.22.1 JIT-compiles kernels at load and requires the CUDA toolkit, specifically nvcc, to be present. A lean base image does not include it, and all four models failed identically with a “could not find nvcc” error until the deployment switched to a CUDA devel image [2]. One image fix unblocked the entire fleet.

On the hardware side, gpt-oss-20b runs in its native MXFP4 quantization and fits on a 24GB L4 GPU with capacity to spare [2]. Prompt normalization and output parsing across four different model families required additional engineering to keep agent behavior consistent at the interface layer while preserving the behavioral differences that motivated the heterogeneous architecture.

What Small Models Can and Cannot Do in This Role

The project’s engineering report characterizes 3B-class models as reliable format generators and unreliable reasoners [1]. With scarcity constraints in place, agents produce valid structured outputs consistently enough to sustain a running simulation. The models handle the formatting contract.

Reasoning quality is a different matter. The models follow prompt instructions on output structure but do not reliably execute multi-step economic logic. Trade decisions that would require planning across several turns, or responses calibrated to market-wide conditions rather than local inventory, fall outside what these models do dependably. The emergent behaviors that make the simulation interesting, the bubbles, the crashes, the wealth concentration, are products of the constraint design rather than of agent cognition.

FAQ

Q. What inference infrastructure does Thousand Token Wood use? The first version serves Qwen2.5-3B via vLLM on Modal [1]. Version 2 runs four heterogeneous models on the same platform, with a fix required to supply the CUDA toolkit to vLLM 0.22.1 [2].

Q. What caused the “could not find nvcc” failure in v2, and how was it resolved? vLLM 0.22.1 JIT-compiles kernels at load time and requires nvcc from the CUDA toolkit. A lean base image does not include it, causing all four models to fail at startup. Switching to a CUDA devel base image resolved the issue for all models simultaneously [2].

Q. Why did the first version of the economy produce no trade? Production outran consumption, making every agent self-sufficient. Without a reason to exchange goods, the market cleared once and stopped. Designed scarcity constraints, including diet variety rules, spoilage, and a firewood monopoly, introduced the conditions necessary for ongoing trade [1].

Q. Does model heterogeneity in v2 produce observable behavioral differences between agents? The project reports that four models trained by different labs with different post-training produce genuinely different agent behavior, described as the owl hoarding differently than the fox speculates [2]. The behavioral diversity is the stated motivation for the multi-model architecture.

Q. What are the known reasoning limitations of 3B-class models in this simulation role? The project characterizes these models as reliable format generators but unreliable reasoners [1]. They consistently produce valid structured outputs but do not dependably execute multi-step economic logic or market-wide planning.

Key Takeaways

Thousand Token Wood runs five autonomous agents on Qwen2.5-3B served via vLLM on Modal, with a Gradio interface, built for Hugging Face’s Build Small Hackathon [1].
Small models are the architecturally correct choice for real-time multi-agent simulations because frontier models are too slow and costly to serve a council of agents on every tick [1].
Emergent economic behavior required engineered scarcity: diet variety rules, spoilage mechanics, and a firewood monopoly forced trade where a naive implementation produced none [1].
Version 2 introduced four heterogeneous models from OpenAI, OpenBMB, NVIDIA, and a fine-tuned Qwen 0.5B; the primary deployment friction was a missing CUDA toolkit in vLLM 0.22.1, resolved with a single base-image change [2].
3B-class models function reliably as format generators but not as multi-step reasoners; simulation complexity must be encoded in world constraints rather than delegated to model cognition [1].