Local Tool-Calling Inference with Needle 26M, llama.cpp, and OpenTelemetry

Q: What prompt format does Needle expect?

Needle uses a structured prompt with ` ` and ` ` tags. The tools section contains a JSON array of tool schemas with name, description, and parameters. The query section contains the user request. The model responds with a single JSON object containing the selected tool name and arguments.

Why this matters

Needle is a 26M parameter model released by Cactus Compute under the MIT license, trained on 200B tokens and post-trained on 2B tokens of synthesized function-calling data [1]. It achieves 6000 tok/s prefill and 1200 tok/s decode on consumer hardware [2], outperforming models an order of magnitude larger (FunctionGemma-270M, Qwen-0.6B, Granite-350M) on single-shot function calling benchmarks.

The architectural insight is deliberate: tool calling is retrieval-and-assembly, not reasoning. Needle drops feed-forward layers entirely, using only attention and gating (Simple Attention Networks) [1]. The model doesn’t need FFN weights to memorize facts when those facts are injected as structured tool schemas in the prompt.

For production agent systems, this matters operationally. A 26M model fits in under 100 MB of RAM, runs on a phone or edge device, and produces deterministic JSON tool calls without the latency tail of a 7B model. The gap in observability tooling for these small, fast models is real: when a tool dispatch fails at 1200 tok/s, you need span-level visibility into which tool was selected, what arguments were extracted, and how long each stage took. This tutorial closes that gap.

Prerequisites

Python 3.11 or newer
llama-cpp-python installable via pip (CPU build is sufficient; the model is 26M parameters)
Basic familiarity with JSON tool-call schemas (OpenAI function-calling format)
Optional: Phoenix or SigNoz running locally to receive OTLP spans (the tutorial uses a console exporter so no external service is required to run the code)

Setup

Install the Python dependencies. llama-cpp-python ships pre-built CPU wheels for most platforms, so the install is fast. opentelemetry-sdk and opentelemetry-exporter-otlp-proto-grpc cover the tracing pipeline; huggingface-hub handles model weight download.

uv pip install llama-cpp-python huggingface-hub opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc

Verify the key packages are present:

from importlib.metadata import version
for pkg in ["llama_cpp", "huggingface_hub", "opentelemetry-sdk"]:
    try:
        print(pkg, version(pkg))
    except Exception as e:
        print(pkg, "not found:", e)

Step 1: Download the Needle GGUF weights

Needle’s weights are published on Hugging Face at Cactus-Compute/needle [2]. The GGUF quantized file is under 60 MB, well within the sandbox memory budget.

# filename: download_model.py
import os
from huggingface_hub import hf_hub_download

MODEL_REPO = "Cactus-Compute/needle"
GGUF_FILE = "needle-q8_0.gguf"
MODEL_PATH = "/workspace/needle-q8_0.gguf"

if not os.path.exists(MODEL_PATH):
    print(f"Downloading {GGUF_FILE} from {MODEL_REPO} ...")
    downloaded = hf_hub_download(
        repo_id=MODEL_REPO,
        filename=GGUF_FILE,
        local_dir="/workspace",
    )
    print(f"Saved to {downloaded}")
else:
    print(f"Model already present at {MODEL_PATH}")

python /workspace/download_model.py

Step 2: Understand Needle’s prompt format

Needle uses a structured prompt that injects tool schemas as JSON and expects a single JSON object back. The format, taken from the project repository [2], wraps the user query and tool list in a specific template:

<tools>
[{"name": "...", "description": "...", "parameters": {...}}, ...]
</tools>
<query>
User query text here
</query>

The model responds with a single JSON object: {"name": "tool_name", "arguments": {"param": "value"}}.

Write a helper module that formats prompts and parses responses:

# filename: needle_format.py
import json
from typing import Any


def build_prompt(query: str, tools: list[dict]) -> str:
    """Format a query and tool list into Needle's expected prompt."""
    tools_json = json.dumps(tools, indent=2)
    return f"<tools>\n{tools_json}\n</tools>\n<query>\n{query}\n</query>\n"


def parse_tool_call(raw: str) -> dict[str, Any]:
    """Extract the first JSON object from the model's raw output."""
    raw = raw.strip()
    # Find the outermost JSON object
    start = raw.find("{")
    end = raw.rfind("}")
    if start == -1 or end == -1:
        raise ValueError(f"No JSON object found in model output: {raw!r}")
    return json.loads(raw[start : end + 1])

Step 3: Build the OTel tracing layer

The tracing module sets up a TracerProvider with a SimpleSpanProcessor backed by a ConsoleSpanExporter. Using SimpleSpanProcessor is intentional here: it flushes each span synchronously when it ends, so verification blocks can assert on stdout immediately rather than waiting for a batch flush.

If you have Phoenix or SigNoz running locally, swap the exporter for OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True) and the spans will appear in your UI without any other code changes.

# filename: tracing.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter

_provider: TracerProvider | None = None


def get_tracer(service_name: str = "needle-agent") -> trace.Tracer:
    """Return a module-level tracer, initialising the provider once."""
    global _provider
    if _provider is None:
        _provider = TracerProvider()
        _provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
        trace.set_tracer_provider(_provider)
    return trace.get_tracer(service_name)

The same span structure indexes the same way on Datadog or Honeycomb, only the exporter endpoint changes.

Step 4: Write the agent dispatcher

The dispatcher loads the model once, then for each query it:

Formats the prompt with the tool schemas.
Runs inference via llama-cpp-python.
Parses the JSON tool call from the output.
Dispatches to the matching Python function.
Wraps every stage in an OTel span with relevant attributes.

# filename: agent.py
import json
import time
from typing import Any, Callable

from llama_cpp import Llama

from needle_format import build_prompt, parse_tool_call
from tracing import get_tracer

tracer = get_tracer("needle-agent")

MODEL_PATH = "/workspace/needle-q8_0.gguf"


def load_model(model_path: str = MODEL_PATH) -> Llama:
    """Load the Needle GGUF model via llama-cpp-python."""
    return Llama(
        model_path=model_path,
        n_ctx=512,       # Needle is a single-shot model; small context is fine
        n_threads=2,
        verbose=False,
    )


def run_tool_call(
    llm: Llama,
    query: str,
    tools: list[dict],
    registry: dict[str, Callable],
) -> dict[str, Any]:
    """Run a single tool-calling turn with full OTel instrumentation."""
    with tracer.start_as_current_span("needle.tool_call") as root_span:
        root_span.set_attribute("needle.query", query)
        root_span.set_attribute("needle.tool_count", len(tools))

        # --- Inference ---
        with tracer.start_as_current_span("needle.inference") as inf_span:
            prompt = build_prompt(query, tools)
            inf_span.set_attribute("needle.prompt_chars", len(prompt))

            t0 = time.perf_counter()
            result = llm(
                prompt,
                max_tokens=128,
                temperature=0.0,
                stop=["\n\n", "</s>"],
            )
            elapsed_ms = (time.perf_counter() - t0) * 1000

            raw_output: str = result["choices"][0]["text"]
            tokens_generated: int = result["usage"]["completion_tokens"]

            inf_span.set_attribute("needle.output_tokens", tokens_generated)
            inf_span.set_attribute("needle.latency_ms", round(elapsed_ms, 2))
            inf_span.set_attribute("needle.raw_output", raw_output[:200])

        # --- Parse ---
        with tracer.start_as_current_span("needle.parse") as parse_span:
            tool_call = parse_tool_call(raw_output)
            tool_name = tool_call.get("name", "")
            tool_args = tool_call.get("arguments", {})
            parse_span.set_attribute("needle.tool_name", tool_name)
            parse_span.set_attribute("needle.tool_args", json.dumps(tool_args))

        # --- Dispatch ---
        with tracer.start_as_current_span("needle.dispatch") as disp_span:
            disp_span.set_attribute("needle.tool_name", tool_name)
            if tool_name not in registry:
                disp_span.set_attribute("needle.dispatch_error", "unknown_tool")
                raise KeyError(f"Tool '{tool_name}' not found in registry")
            handler = registry[tool_name]
            dispatch_result = handler(**tool_args)
            disp_span.set_attribute("needle.dispatch_result", str(dispatch_result)[:200])

        root_span.set_attribute("needle.success", True)
        return {"tool": tool_name, "args": tool_args, "result": dispatch_result}

Step 5: Define example tools and run the agent

Define a small tool registry with two functions: a timer setter and a smart-home light controller. These match the categories Needle was post-trained on [1].

# filename: tools.py
def set_timer(duration_seconds: int, label: str = "") -> str:
    """Simulate setting a timer."""
    return f"Timer set for {duration_seconds}s" + (f" ({label})" if label else "")


def control_light(room: str, action: str, brightness: int = 100) -> str:
    """Simulate controlling a smart-home light."""
    return f"Light in {room} turned {action} at {brightness}% brightness"

Now wire everything together in a driver script. This script loads the model, runs two queries, and prints the dispatched results alongside the OTel spans.

# filename: run_agent.py
import json
from agent import load_model, run_tool_call
from tools import set_timer, control_light

TOOL_SCHEMAS = [
    {
        "name": "set_timer",
        "description": "Set a countdown timer for a given number of seconds.",
        "parameters": {
            "type": "object",
            "properties": {
                "duration_seconds": {"type": "integer", "description": "Duration in seconds"},
                "label": {"type": "string", "description": "Optional label for the timer"},
            },
            "required": ["duration_seconds"],
        },
    },
    {
        "name": "control_light",
        "description": "Turn a smart-home light on or off, with optional brightness.",
        "parameters": {
            "type": "object",
            "properties": {
                "room": {"type": "string", "description": "Room name"},
                "action": {"type": "string", "enum": ["on", "off"]},
                "brightness": {"type": "integer", "description": "Brightness 0-100"},
            },
            "required": ["room", "action"],
        },
    },
]

REGISTRY = {
    "set_timer": set_timer,
    "control_light": control_light,
}

QUERIES = [
    "Set a 5 minute pasta timer",
    "Turn off the bedroom lights",
]


def main():
    print("Loading Needle model...")
    llm = load_model()
    print("Model loaded. Running queries...\n")

    for query in QUERIES:
        print(f"Query: {query}")
        outcome = run_tool_call(llm, query, TOOL_SCHEMAS, REGISTRY)
        print(f"  Tool   : {outcome['tool']}")
        print(f"  Args   : {json.dumps(outcome['args'])}")
        print(f"  Result : {outcome['result']}")
        print()

    print("agent_run_complete")


if __name__ == "__main__":
    main()

Verify it works

Run the driver. The console exporter will print span JSON to stdout alongside the agent output. Look for needle.tool_call, needle.inference, needle.parse, and needle.dispatch spans, each with their attributes.

python /workspace/run_agent.py

After the run, confirm the span structure is correct with a focused verification script:

import io
import sys
import json

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter

from needle_format import build_prompt, parse_tool_call
from tracing import get_tracer

# Use an in-memory exporter for deterministic assertion
mem_exporter = InMemorySpanExporter()
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(mem_exporter))
trace.set_tracer_provider(provider)

# Re-import tracer with fresh provider
import tracing as tracing_mod
tracing_mod._provider = provider
test_tracer = provider.get_tracer("needle-agent")

# Emit a synthetic span to verify the pipeline
with test_tracer.start_as_current_span("needle.tool_call") as span:
    span.set_attribute("needle.query", "test query")
    span.set_attribute("needle.tool_count", 2)
    with test_tracer.start_as_current_span("needle.inference") as inf:
        inf.set_attribute("needle.output_tokens", 12)
        inf.set_attribute("needle.latency_ms", 8.5)
    with test_tracer.start_as_current_span("needle.parse") as p:
        p.set_attribute("needle.tool_name", "set_timer")
    with test_tracer.start_as_current_span("needle.dispatch") as d:
        d.set_attribute("needle.tool_name", "set_timer")
        d.set_attribute("needle.dispatch_result", "Timer set for 300s")

spans = mem_exporter.get_finished_spans()
span_names = [s.name for s in spans]
print("Finished spans:", span_names)

assert "needle.tool_call" in span_names, "root span missing"
assert "needle.inference" in span_names, "inference span missing"
assert "needle.parse" in span_names, "parse span missing"
assert "needle.dispatch" in span_names, "dispatch span missing"

# Verify attributes on the inference span
inf_span = next(s for s in spans if s.name == "needle.inference")
assert inf_span.attributes["needle.output_tokens"] == 12
assert inf_span.attributes["needle.latency_ms"] == 8.5

print("All span assertions passed")
print("otel_verification_ok")

Routing spans to Phoenix or SigNoz

The console exporter is sufficient for local debugging. To route spans to a running Phoenix or SigNoz instance, replace the exporter in tracing.py:

# filename: tracing_otlp.py
"""
Drop-in replacement for tracing.py that sends spans via OTLP/gRPC.
Set OTEL_EXPORTER_OTLP_ENDPOINT before importing (default: localhost:4317).
"""
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

_provider: TracerProvider | None = None


def get_tracer(service_name: str = "needle-agent") -> trace.Tracer:
    global _provider
    if _provider is None:
        endpoint = os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "localhost:4317")
        exporter = OTLPSpanExporter(endpoint=endpoint, insecure=True)
        _provider = TracerProvider()
        _provider.add_span_processor(BatchSpanProcessor(exporter))
        trace.set_tracer_provider(_provider)
    return trace.get_tracer(service_name)

For Phoenix, start it with python -m phoenix.server.main serve and set OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317. For SigNoz, the collector listens on the same port by default. The span names and attributes are identical; only the exporter changes.

Troubleshooting

llama_cpp import fails with OSError: libllama.so not found The pre-built wheel bundles the shared library. If you built llama.cpp from source and installed the Python bindings manually, ensure CMAKE_ARGS="-DLLAMA_BLAS=OFF" was set during the build, or reinstall the wheel with uv pip install llama-cpp-python --force-reinstall.

huggingface_hub download times out or returns a 404 The GGUF filename on the Needle repository may change as the project evolves. Run huggingface_hub.list_repo_files("Cactus-Compute/needle") to list available files and update GGUF_FILE in download_model.py accordingly.

Model output is empty or contains only whitespace Needle is sensitive to the exact prompt format. Confirm the prompt contains both <tools> and <query> tags with valid JSON inside <tools>. Print the raw prompt before inference to inspect it.

parse_tool_call raises ValueError: No JSON object found The model occasionally emits a partial response if max_tokens is too low or the tool schema is very large. Increase max_tokens to 256 and reduce the number of tools in the schema to the minimum needed for the query.

Spans appear in the console but not in Phoenix/SigNoz Confirm the collector is listening on port 4317 with curl -v telnet://localhost:4317. If using BatchSpanProcessor, call provider.force_flush() before process exit to drain the buffer. The SimpleSpanProcessor used in tracing.py flushes synchronously and does not need this.

Memory error when loading the model The Q8 quantization of a 26M model uses roughly 30-50 MB. If you’re on a very constrained device, try the Q4 variant if available, or reduce n_ctx to 256.

Next steps

Multi-turn agent loop: Needle is designed for single-shot dispatch [1], but you can chain calls by feeding the dispatch result back into a second query. Wrap run_tool_call in a loop and add a needle.turn attribute to each root span to track conversation depth.
Custom tool fine-tuning: The Needle repository includes fine-tuning scripts for Mac and PC [2]. Train on your own tool schemas and export to GGUF for drop-in replacement in this pipeline.
Edge deployment via Cactus: The Cactus inference engine (https://github.com/cactus-compute/cactus) targets mobile and wearable hardware. The same GGUF weights and tool schemas work there; replace the llama-cpp-python calls with the Cactus SDK.
Structured output validation: Add a Pydantic model for each tool’s argument schema and validate tool_args before dispatch. Emit a needle.validation_error span event when validation fails to track schema drift over time.

FAQ

What is Needle and why use it for tool calling?

Needle is a 26M parameter model from Cactus Compute trained on 2B tokens of function-calling data. It achieves 6000 tok/s prefill and 1200 tok/s decode on consumer hardware while outperforming much larger models on single-shot function calling benchmarks. The model uses only attention and gating layers, dropping feed-forward layers entirely because tool calling is retrieval-and-assembly, not reasoning.

How does the agent dispatcher work with OpenTelemetry?

The dispatcher wraps each stage of tool calling in an OTel span: prompt formatting, inference, JSON parsing, and tool dispatch. Each span captures relevant attributes like query text, output tokens, latency, tool name, and dispatch result. The spans can be exported to a console, Phoenix, or SigNoz by swapping the exporter without changing any other code.

What prompt format does Needle expect?

Needle uses a structured prompt with <tools> and <query> tags. The tools section contains a JSON array of tool schemas with name, description, and parameters. The query section contains the user request. The model responds with a single JSON object containing the selected tool name and arguments.

Can spans be routed to Phoenix or SigNoz instead of the console?

Yes. Replace the ConsoleSpanExporter in tracing.py with OTLPSpanExporter pointing to localhost:4317 and use BatchSpanProcessor instead of SimpleSpanProcessor. The span names and attributes remain identical; only the exporter endpoint changes.

What are the memory and performance requirements?

The Q8 quantized Needle model is under 60 MB and uses 30-50 MB of RAM when loaded. It runs on consumer hardware and phones, achieving 1200 tok/s decode speed. The small context window (512 tokens) is sufficient for single-shot tool calling.