# Local Tool-Calling Inference with Needle 26M, llama.cpp, and OpenTelemetry

> Build a local inference pipeline that runs Needle, a 26M parameter tool-calling model distilled from Gemini, through llama.cpp's Python bindings, then wire a small Python agent dispatcher to emit OpenTelemetry spans you can inspect in a console exporter or route to Phoenix or SigNoz.

- Canonical URL: https://agentry.press/tutorial/local-tool-calling-inference-with-needle-26m-llama-cpp-and-opentelemetry/
- Type: Tutorial
- Published: 2026-06-03
- By: agentry
- Tags: llama-cpp, tool-calling, opentelemetry, local-inference, agents

---

## Why this matters

Needle is a 26M parameter model released by Cactus Compute under the MIT license, trained on 200B tokens and post-trained on 2B tokens of synthesized function-calling data [1]. It achieves 6000 tok/s prefill and 1200 tok/s decode on consumer hardware [2], outperforming models an order of magnitude larger (FunctionGemma-270M, Qwen-0.6B, Granite-350M) on single-shot function calling benchmarks.

The architectural insight is deliberate: tool calling is retrieval-and-assembly, not reasoning. Needle drops feed-forward layers entirely, using only attention and gating (Simple Attention Networks) [1]. The model doesn't need FFN weights to memorize facts when those facts are injected as structured tool schemas in the prompt.

For production agent systems, this matters operationally. A 26M model fits in under 100 MB of RAM, runs on a phone or edge device, and produces deterministic JSON tool calls without the latency tail of a 7B model. The gap in observability tooling for these small, fast models is real: when a tool dispatch fails at 1200 tok/s, you need span-level visibility into which tool was selected, what arguments were extracted, and how long each stage took. This tutorial closes that gap.

## Prerequisites

- Python 3.11 or newer
- `llama-cpp-python` installable via pip (CPU build is sufficient; the model is 26M parameters)
- Basic familiarity with JSON tool-call schemas (OpenAI function-calling format)
- Optional: Phoenix or SigNoz running locally to receive OTLP spans (the tutorial uses a console exporter so no external service is required to run the code)

## Setup

Install the Python dependencies. `llama-cpp-python` ships pre-built CPU wheels for most platforms, so the install is fast. `opentelemetry-sdk` and `opentelemetry-exporter-otlp-proto-grpc` cover the tracing pipeline; `huggingface-hub` handles model weight download.

```bash
uv pip install llama-cpp-python huggingface-hub opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc
```

Verify the key packages are present:

```python
from importlib.metadata import version
for pkg in ["llama_cpp", "huggingface_hub", "opentelemetry-sdk"]:
    try:
        print(pkg, version(pkg))
    except Exception as e:
        print(pkg, "not found:", e)
```

## Step 1: Download the Needle GGUF weights

Needle's weights are published on Hugging Face at `Cactus-Compute/needle` [2]. The GGUF quantized file is under 60 MB, well within the sandbox memory budget.

```python
# filename: download_model.py
import os
from huggingface_hub import hf_hub_download

MODEL_REPO = "Cactus-Compute/needle"
GGUF_FILE = "needle-q8_0.gguf"
MODEL_PATH = "/workspace/needle-q8_0.gguf"

if not os.path.exists(MODEL_PATH):
    print(f"Downloading {GGUF_FILE} from {MODEL_REPO} ...")
    downloaded = hf_hub_download(
        repo_id=MODEL_REPO,
        filename=GGUF_FILE,
        local_dir="/workspace",
    )
    print(f"Saved to {downloaded}")
else:
    print(f"Model already present at {MODEL_PATH}")
```

```bash
python /workspace/download_model.py
```

## Step 2: Understand Needle's prompt format

Needle uses a structured prompt that injects tool schemas as JSON and expects a single JSON object back. The format, taken from the project repository [2], wraps the user query and tool list in a specific template:

```
<tools>
[{"name": "...", "description": "...", "parameters": {...}}, ...]
</tools>
<query>
User query text here
</query>
```

The model responds with a single JSON object: `{"name": "tool_name", "arguments": {"param": "value"}}`.

Write a helper module that formats prompts and parses responses:

```python
# filename: needle_format.py
import json
from typing import Any


def build_prompt(query: str, tools: list[dict]) -> str:
    """Format a query and tool list into Needle's expected prompt."""
    tools_json = json.dumps(tools, indent=2)
    return f"<tools>\n{tools_json}\n</tools>\n<query>\n{query}\n</query>\n"


def parse_tool_call(raw: str) -> dict[str, Any]:
    """Extract the first JSON object from the model's raw output."""
    raw = raw.strip()
    # Find the outermost JSON object
    start = raw.find("{")
    end = raw.rfind("}")
    if start == -1 or end == -1:
        raise ValueError(f"No JSON object found in model output: {raw!r}")
    return json.loads(raw[start : end + 1])
```

## Step 3: Build the OTel tracing layer

The tracing module sets up a `TracerProvider` with a `SimpleSpanProcessor` backed by a `ConsoleSpanExporter`. Using `SimpleSpanProcessor` is intentional here: it flushes each span synchronously when it ends, so verification blocks can assert on stdout immediately rather than waiting for a batch flush.

If you have Phoenix or SigNoz running locally, swap the exporter for `OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True)` and the spans will appear in your UI without any other code changes.

```python
# filename: tracing.py
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter

_provider: TracerProvider | None = None


def get_tracer(service_name: str = "needle-agent") -> trace.Tracer:
    """Return a module-level tracer, initialising the provider once."""
    global _provider
    if _provider is None:
        _provider = TracerProvider()
        _provider.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
        trace.set_tracer_provider(_provider)
    return trace.get_tracer(service_name)
```

> [!PULLQUOTE]
> The same span structure indexes the same way on Datadog or Honeycomb, only the exporter endpoint changes.

## Step 4: Write the agent dispatcher

The dispatcher loads the model once, then for each query it:

1. Formats the prompt with the tool schemas.
2. Runs inference via `llama-cpp-python`.
3. Parses the JSON tool call from the output.
4. Dispatches to the matching Python function.
5. Wraps every stage in an OTel span with relevant attributes.

```python
# filename: agent.py
import json
import time
from typing import Any, Callable

from llama_cpp import Llama

from needle_format import build_prompt, parse_tool_call
from tracing import get_tracer

tracer = get_tracer("needle-agent")

MODEL_PATH = "/workspace/needle-q8_0.gguf"


def load_model(model_path: str = MODEL_PATH) -> Llama:
    """Load the Needle GGUF model via llama-cpp-python."""
    return Llama(
        model_path=model_path,
        n_ctx=512,       # Needle is a single-shot model; small context is fine
        n_threads=2,
        verbose=False,
    )


def run_tool_call(
    llm: Llama,
    query: str,
    tools: list[dict],
    registry: dict[str, Callable],
) -> dict[str, Any]:
    """Run a single tool-calling turn with full OTel instrumentation."""
    with tracer.start_as_current_span("needle.tool_call") as root_span:
        root_span.set_attribute("needle.query", query)
        root_span.set_attribute("needle.tool_count", len(tools))

        # --- Inference ---
        with tracer.start_as_current_span("needle.inference") as inf_span:
            prompt = build_prompt(query, tools)
            inf_span.set_attribute("needle.prompt_chars", len(prompt))

            t0 = time.perf_counter()
            result = llm(
                prompt,
                max_tokens=128,
                temperature=0.0,
                stop=["\n\n", "</s>"],
            )
            elapsed_ms = (time.perf_counter() - t0) * 1000

            raw_output: str = result["choices"][0]["text"]
            tokens_generated: int = result["usage"]["completion_tokens"]

            inf_span.set_attribute("needle.output_tokens", tokens_generated)
            inf_span.set_attribute("needle.latency_ms", round(elapsed_ms, 2))
            inf_span.set_attribute("needle.raw_output", raw_output[:200])

        # --- Parse ---
        with tracer.start_as_current_span("needle.parse") as parse_span:
            tool_call = parse_tool_call(raw_output)
            tool_name = tool_call.get("name", "")
            tool_args = tool_call.get("arguments", {})
            parse_span.set_attribute("needle.tool_name", tool_name)
            parse_span.set_attribute("needle.tool_args", json.dumps(tool_args))

        # --- Dispatch ---
        with tracer.start_as_current_span("needle.dispatch") as disp_span:
            disp_span.set_attribute("needle.tool_name", tool_name)
            if tool_name not in registry:
                disp_span.set_attribute("needle.dispatch_error", "unknown_tool")
                raise KeyError(f"Tool '{tool_name}' not found in registry")
            handler = registry[tool_name]
            dispatch_result = handler(**tool_args)
            disp_span.set_attribute("needle.dispatch_result", str(dispatch_result)[:200])

        root_span.set_attribute("needle.success", True)
        return {"tool": tool_name, "args": tool_args, "result": dispatch_result}
```

## Step 5: Define example tools and run the agent

Define a small tool registry with two functions: a timer setter and a smart-home light controller. These match the categories Needle was post-trained on [1].

```python
# filename: tools.py
def set_timer(duration_seconds: int, label: str = "") -> str:
    """Simulate setting a timer."""
    return f"Timer set for {duration_seconds}s" + (f" ({label})" if label else "")


def control_light(room: str, action: str, brightness: int = 100) -> str:
    """Simulate controlling a smart-home light."""
    return f"Light in {room} turned {action} at {brightness}% brightness"
```

Now wire everything together in a driver script. This script loads the model, runs two queries, and prints the dispatched results alongside the OTel spans.

```python
# filename: run_agent.py
import json
from agent import load_model, run_tool_call
from tools import set_timer, control_light

TOOL_SCHEMAS = [
    {
        "name": "set_timer",
        "description": "Set a countdown timer for a given number of seconds.",
        "parameters": {
            "type": "object",
            "properties": {
                "duration_seconds": {"type": "integer", "description": "Duration in seconds"},
                "label": {"type": "string", "description": "Optional label for the timer"},
            },
            "required": ["duration_seconds"],
        },
    },
    {
        "name": "control_light",
        "description": "Turn a smart-home light on or off, with optional brightness.",
        "parameters": {
            "type": "object",
            "properties": {
                "room": {"type": "string", "description": "Room name"},
                "action": {"type": "string", "enum": ["on", "off"]},
                "brightness": {"type": "integer", "description": "Brightness 0-100"},
            },
            "required": ["room", "action"],
        },
    },
]

REGISTRY = {
    "set_timer": set_timer,
    "control_light": control_light,
}

QUERIES = [
    "Set a 5 minute pasta timer",
    "Turn off the bedroom lights",
]


def main():
    print("Loading Needle model...")
    llm = load_model()
    print("Model loaded. Running queries...\n")

    for query in QUERIES:
        print(f"Query: {query}")
        outcome = run_tool_call(llm, query, TOOL_SCHEMAS, REGISTRY)
        print(f"  Tool   : {outcome['tool']}")
        print(f"  Args   : {json.dumps(outcome['args'])}")
        print(f"  Result : {outcome['result']}")
        print()

    print("agent_run_complete")


if __name__ == "__main__":
    main()
```

## Verify it works

Run the driver. The console exporter will print span JSON to stdout alongside the agent output. Look for `needle.tool_call`, `needle.inference`, `needle.parse`, and `needle.dispatch` spans, each with their attributes.

```bash
python /workspace/run_agent.py
```

After the run, confirm the span structure is correct with a focused verification script:

```python
import io
import sys
import json

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter

from needle_format import build_prompt, parse_tool_call
from tracing import get_tracer

# Use an in-memory exporter for deterministic assertion
mem_exporter = InMemorySpanExporter()
provider = TracerProvider()
provider.add_span_processor(SimpleSpanProcessor(mem_exporter))
trace.set_tracer_provider(provider)

# Re-import tracer with fresh provider
import tracing as tracing_mod
tracing_mod._provider = provider
test_tracer = provider.get_tracer("needle-agent")

# Emit a synthetic span to verify the pipeline
with test_tracer.start_as_current_span("needle.tool_call") as span:
    span.set_attribute("needle.query", "test query")
    span.set_attribute("needle.tool_count", 2)
    with test_tracer.start_as_current_span("needle.inference") as inf:
        inf.set_attribute("needle.output_tokens", 12)
        inf.set_attribute("needle.latency_ms", 8.5)
    with test_tracer.start_as_current_span("needle.parse") as p:
        p.set_attribute("needle.tool_name", "set_timer")
    with test_tracer.start_as_current_span("needle.dispatch") as d:
        d.set_attribute("needle.tool_name", "set_timer")
        d.set_attribute("needle.dispatch_result", "Timer set for 300s")

spans = mem_exporter.get_finished_spans()
span_names = [s.name for s in spans]
print("Finished spans:", span_names)

assert "needle.tool_call" in span_names, "root span missing"
assert "needle.inference" in span_names, "inference span missing"
assert "needle.parse" in span_names, "parse span missing"
assert "needle.dispatch" in span_names, "dispatch span missing"

# Verify attributes on the inference span
inf_span = next(s for s in spans if s.name == "needle.inference")
assert inf_span.attributes["needle.output_tokens"] == 12
assert inf_span.attributes["needle.latency_ms"] == 8.5

print("All span assertions passed")
print("otel_verification_ok")
```

## Routing spans to Phoenix or SigNoz

The console exporter is sufficient for local debugging. To route spans to a running Phoenix or SigNoz instance, replace the exporter in `tracing.py`:

```python
# filename: tracing_otlp.py
"""
Drop-in replacement for tracing.py that sends spans via OTLP/gRPC.
Set OTEL_EXPORTER_OTLP_ENDPOINT before importing (default: localhost:4317).
"""
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

_provider: TracerProvider | None = None


def get_tracer(service_name: str = "needle-agent") -> trace.Tracer:
    global _provider
    if _provider is None:
        endpoint = os.environ.get("OTEL_EXPORTER_OTLP_ENDPOINT", "localhost:4317")
        exporter = OTLPSpanExporter(endpoint=endpoint, insecure=True)
        _provider = TracerProvider()
        _provider.add_span_processor(BatchSpanProcessor(exporter))
        trace.set_tracer_provider(_provider)
    return trace.get_tracer(service_name)
```

For Phoenix, start it with `python -m phoenix.server.main serve` and set `OTEL_EXPORTER_OTLP_ENDPOINT=localhost:4317`. For SigNoz, the collector listens on the same port by default. The span names and attributes are identical; only the exporter changes.

## Troubleshooting

**`llama_cpp` import fails with `OSError: libllama.so not found`**
The pre-built wheel bundles the shared library. If you built llama.cpp from source and installed the Python bindings manually, ensure `CMAKE_ARGS="-DLLAMA_BLAS=OFF"` was set during the build, or reinstall the wheel with `uv pip install llama-cpp-python --force-reinstall`.

**`huggingface_hub` download times out or returns a 404**
The GGUF filename on the Needle repository may change as the project evolves. Run `huggingface_hub.list_repo_files("Cactus-Compute/needle")` to list available files and update `GGUF_FILE` in `download_model.py` accordingly.

**Model output is empty or contains only whitespace**
Needle is sensitive to the exact prompt format. Confirm the prompt contains both `<tools>` and `<query>` tags with valid JSON inside `<tools>`. Print the raw prompt before inference to inspect it.

**`parse_tool_call` raises `ValueError: No JSON object found`**
The model occasionally emits a partial response if `max_tokens` is too low or the tool schema is very large. Increase `max_tokens` to 256 and reduce the number of tools in the schema to the minimum needed for the query.

**Spans appear in the console but not in Phoenix/SigNoz**
Confirm the collector is listening on port 4317 with `curl -v telnet://localhost:4317`. If using `BatchSpanProcessor`, call `provider.force_flush()` before process exit to drain the buffer. The `SimpleSpanProcessor` used in `tracing.py` flushes synchronously and does not need this.

**Memory error when loading the model**
The Q8 quantization of a 26M model uses roughly 30-50 MB. If you're on a very constrained device, try the Q4 variant if available, or reduce `n_ctx` to 256.

## Next steps

- **Multi-turn agent loop**: Needle is designed for single-shot dispatch [1], but you can chain calls by feeding the dispatch result back into a second query. Wrap `run_tool_call` in a loop and add a `needle.turn` attribute to each root span to track conversation depth.
- **Custom tool fine-tuning**: The Needle repository includes fine-tuning scripts for Mac and PC [2]. Train on your own tool schemas and export to GGUF for drop-in replacement in this pipeline.
- **Edge deployment via Cactus**: The Cactus inference engine (https://github.com/cactus-compute/cactus) targets mobile and wearable hardware. The same GGUF weights and tool schemas work there; replace the `llama-cpp-python` calls with the Cactus SDK.
- **Structured output validation**: Add a Pydantic model for each tool's argument schema and validate `tool_args` before dispatch. Emit a `needle.validation_error` span event when validation fails to track schema drift over time.

## FAQ

### What is Needle and why use it for tool calling?

Needle is a 26M parameter model from Cactus Compute trained on 2B tokens of function-calling data. It achieves 6000 tok/s prefill and 1200 tok/s decode on consumer hardware while outperforming much larger models on single-shot function calling benchmarks. The model uses only attention and gating layers, dropping feed-forward layers entirely because tool calling is retrieval-and-assembly, not reasoning.

### How does the agent dispatcher work with OpenTelemetry?

The dispatcher wraps each stage of tool calling in an OTel span: prompt formatting, inference, JSON parsing, and tool dispatch. Each span captures relevant attributes like query text, output tokens, latency, tool name, and dispatch result. The spans can be exported to a console, Phoenix, or SigNoz by swapping the exporter without changing any other code.

### What prompt format does Needle expect?

Needle uses a structured prompt with `<tools>` and `<query>` tags. The tools section contains a JSON array of tool schemas with name, description, and parameters. The query section contains the user request. The model responds with a single JSON object containing the selected tool name and arguments.

### Can spans be routed to Phoenix or SigNoz instead of the console?

Yes. Replace the ConsoleSpanExporter in tracing.py with OTLPSpanExporter pointing to localhost:4317 and use BatchSpanProcessor instead of SimpleSpanProcessor. The span names and attributes remain identical; only the exporter endpoint changes.

### What are the memory and performance requirements?

The Q8 quantized Needle model is under 60 MB and uses 30-50 MB of RAM when loaded. It runs on consumer hardware and phones, achieving 1200 tok/s decode speed. The small context window (512 tokens) is sufficient for single-shot tool calling.

## References

1. https://www.reddit.com/r/LocalLLaMA/comments/1tb9b0r/needle_we_distilled_gemini_tool_calling_into_a/
2. https://github.com/cactus-compute/needle
