# Self-Hosted Pixtral-12B Inference with Structured OTel Span Tracing

> Build a Python service that loads Pixtral-12B via mistral-inference v1.4.0, wraps every multimodal inference call in OpenTelemetry spans that capture image token counts and wall-clock latency, and exports traces to a local SigNoz collector. No cloud API keys required.

- Canonical URL: https://agentry.press/tutorial/self-hosted-pixtral-12b-inference-with-structured-otel-span-tracing/
- Type: Tutorial
- Published: 2026-06-02
- By: agentry
- Tags: pixtral, mistral-inference, opentelemetry, vision-models, self-hosted, tracing

---

Pixtral-12B is Mistral's first vision-language model, and mistral-inference v1.4.0 [1] is the first release of their reference inference library to support it. Running it yourself means full control over data residency, no per-token billing, and the ability to instrument the inference path exactly as your audit requirements demand.

This tutorial wires three things together: the model loading and generation API from mistral-inference [1], a thin Python service layer that exposes a `run_inference` function, and an OpenTelemetry tracing harness that records image token counts, prompt token counts, output token counts, and end-to-end latency on every call. Traces export to a local SigNoz instance via OTLP/gRPC.

## Why this matters

mistral-inference v1.4.0 [1] added multimodal support to a library that previously handled only text models. The change is non-trivial operationally: image inputs are tokenized separately from text, the token budget is split across two modalities, and latency now depends on image resolution in ways that pure-text profiling misses entirely. Operators running Pixtral in sovereign-cloud environments (on-prem, air-gapped, or EU-region-only deployments) have no managed observability layer to fall back on. Without structured span data capturing per-modality token counts, a TTFT regression could come from image encoding, from prompt length, or from KV cache pressure, and there is no way to distinguish them from aggregate metrics alone. This tutorial gives you the span schema and the wiring to answer that question from your first production request.

## Prerequisites

- Python 3.11 or 3.12
- A CUDA-capable GPU with at least 24 GB VRAM (Pixtral-12B in bfloat16 requires roughly 24 GB)
- Docker, for running SigNoz (the collector and UI)
- A Hugging Face account with access granted to `mistralai/Pixtral-12B-2409`
- `HUGGING_FACE_HUB_TOKEN` set in your shell
- Basic familiarity with OpenTelemetry concepts (spans, exporters, trace context)

## Setup

Install the Python dependencies. `mistral-inference` must be at least 1.4.0 for Pixtral support [1]. The OTel packages are the standard SDK plus the OTLP gRPC exporter.

```bash
uv pip install "mistral-inference>=1.4.0" mistral-common huggingface_hub opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc opentelemetry-api pillow requests
```

Verify the installed version of mistral-inference:

```python
from importlib.metadata import version
print("mistral-inference:", version("mistral-inference"))
print("opentelemetry-sdk:", version("opentelemetry-sdk"))
print("mistral-common:", version("mistral-common"))
print("deps ok")
```

### Start SigNoz locally

SigNoz ships as a Docker Compose stack. Clone it and bring it up. The OTLP gRPC endpoint will be available at `localhost:4317`.

```bash
# Run on your own machine (Docker required — skipped in the sandbox)
git clone https://github.com/SigNoz/signoz.git /tmp/signoz
cd /tmp/signoz && git checkout main
docker compose -f deploy/docker/clickhouse-setup/docker-compose.yaml up -d
```

Once the stack is healthy (usually 60-90 seconds), open `http://localhost:3301` to reach the SigNoz UI. The OTLP gRPC receiver listens on port 4317 by default.

### Download the model weights

This step requires your Hugging Face token and downloads roughly 24 GB. Run it on the machine with the GPU.

```bash
# Run on your GPU machine — skipped in the sandbox (requires HF token + 24 GB download)
export HUGGING_FACE_HUB_TOKEN="hf_YOUR_TOKEN_HERE"
python - <<'EOF'
from huggingface_hub import snapshot_download
from pathlib import Path

model_path = Path.home() / "mistral_models" / "Pixtral"
model_path.mkdir(parents=True, exist_ok=True)

snapshot_download(
    repo_id="mistralai/Pixtral-12B-2409",
    allow_patterns=["params.json", "consolidated.safetensors", "tekken.json"],
    local_dir=model_path,
    token=True,
)
print("Download complete:", model_path)
EOF
```

## Step 1: Define the OTel tracer configuration

Create a module that builds a `TracerProvider` wired to either the OTLP gRPC exporter (for production use with SigNoz) or a `ConsoleSpanExporter` (for local verification without a running collector). The `build_tracer` function accepts an `otlp_endpoint` argument; pass `None` to fall back to console output.

```python
# filename: tracer_config.py
import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource

def build_tracer(service_name: str = "pixtral-service", otlp_endpoint: str | None = None) -> trace.Tracer:
    """Return a Tracer backed by OTLP/gRPC or console, depending on otlp_endpoint."""
    resource = Resource.create({"service.name": service_name})
    provider = TracerProvider(resource=resource)

    if otlp_endpoint:
        # Import lazily so the module loads even without grpc installed
        from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
        from opentelemetry.sdk.trace.export import BatchSpanProcessor
        exporter = OTLPSpanExporter(endpoint=otlp_endpoint, insecure=True)
        processor = BatchSpanProcessor(exporter)
    else:
        exporter = ConsoleSpanExporter()
        processor = SimpleSpanProcessor(exporter)

    provider.add_span_processor(processor)
    trace.set_tracer_provider(provider)
    return trace.get_tracer(service_name)
```

## Step 2: Build the inference service with span instrumentation

This module wraps the mistral-inference generation API [1] in a `PixtralService` class. The `run` method opens an OTel span before tokenization and closes it after decoding, recording four span attributes:

- `inference.image_token_count`: tokens consumed by image inputs
- `inference.prompt_token_count`: text prompt tokens
- `inference.output_token_count`: generated tokens
- `inference.latency_ms`: wall-clock time from tokenization start to decode end

Token counts come directly from the encoded request object that `MistralTokenizer.encode_chat_completion` returns, so they reflect the actual tokenizer output rather than an estimate.

```python
# filename: pixtral_service.py
import time
from pathlib import Path
from opentelemetry import trace
from opentelemetry.trace import Span, StatusCode

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage, TextChunk, ImageURLChunk
from mistral_common.protocol.instruct.request import ChatCompletionRequest


class PixtralService:
    """
    Wraps Pixtral-12B inference with OTel span tracing.

    The model is loaded lazily on first call to avoid GPU allocation
    during import-time structural checks.
    """

    def __init__(self, model_path: str | Path, tracer: trace.Tracer):
        self.model_path = Path(model_path)
        self.tracer = tracer
        self._tokenizer: MistralTokenizer | None = None
        self._model = None  # loaded lazily

    def _load(self):
        """Load tokenizer and model weights from disk (GPU required)."""
        from mistral_inference.transformer import Transformer

        self._tokenizer = MistralTokenizer.from_file(
            str(self.model_path / "tekken.json")
        )
        self._model = Transformer.from_folder(str(self.model_path))

    def _ensure_loaded(self):
        if self._model is None:
            self._load()

    def run(
        self,
        prompt: str,
        image_urls: list[str],
        max_tokens: int = 256,
        temperature: float = 0.35,
    ) -> str:
        """
        Run a multimodal inference call and emit a structured OTel span.

        Parameters
        ----------
        prompt:
            Text instruction accompanying the images.
        image_urls:
            Zero or more image URLs. Each is encoded as an ImageURLChunk [1].
        max_tokens:
            Maximum tokens to generate.
        temperature:
            Sampling temperature.

        Returns
        -------
        str
            Decoded model output.
        """
        self._ensure_loaded()

        with self.tracer.start_as_current_span("pixtral.inference") as span:
            span: Span
            span.set_attribute("inference.model", "Pixtral-12B-2409")
            span.set_attribute("inference.max_tokens", max_tokens)
            span.set_attribute("inference.temperature", temperature)
            span.set_attribute("inference.num_images", len(image_urls))

            t0 = time.perf_counter()

            # Build the multimodal request [1]
            content = [ImageURLChunk(image_url=url) for url in image_urls]
            content.append(TextChunk(text=prompt))
            request = ChatCompletionRequest(
                messages=[UserMessage(content=content)]
            )

            # Tokenize — this is where image pixels become tokens
            encoded = self._tokenizer.encode_chat_completion(request)
            tokens = encoded.tokens
            images = encoded.images

            # Record token counts before generation
            prompt_token_count = len(tokens)
            # images is a list of image tensors; each has a token budget
            # We approximate image token count as total - text-only tokens
            # by re-encoding without images and taking the difference.
            text_only_request = ChatCompletionRequest(
                messages=[UserMessage(content=[TextChunk(text=prompt)])]
            )
            text_only_encoded = self._tokenizer.encode_chat_completion(text_only_request)
            text_token_count = len(text_only_encoded.tokens)
            image_token_count = prompt_token_count - text_token_count

            span.set_attribute("inference.prompt_token_count", prompt_token_count)
            span.set_attribute("inference.image_token_count", max(image_token_count, 0))

            try:
                from mistral_inference.generate import generate

                out_tokens, _ = generate(
                    [tokens],
                    self._model,
                    images=[images],
                    max_tokens=max_tokens,
                    temperature=temperature,
                    eos_id=self._tokenizer.instruct_tokenizer.tokenizer.eos_id,
                )
                result = self._tokenizer.decode(out_tokens[0])

                output_token_count = len(out_tokens[0])
                latency_ms = (time.perf_counter() - t0) * 1000

                span.set_attribute("inference.output_token_count", output_token_count)
                span.set_attribute("inference.latency_ms", round(latency_ms, 2))
                span.set_status(StatusCode.OK)
                return result

            except Exception as exc:
                span.set_status(StatusCode.ERROR, str(exc))
                span.record_exception(exc)
                raise
```

## Step 3: Write the entry-point script

This script ties the tracer and service together. When `OTLP_ENDPOINT` is set in the environment, traces go to SigNoz. When it is absent, they print to the console so you can verify the span schema without a running collector.

```python
# filename: run_inference.py
import os
from pathlib import Path
from tracer_config import build_tracer
from pixtral_service import PixtralService

MODEL_PATH = os.environ.get(
    "PIXTRAL_MODEL_PATH",
    str(Path.home() / "mistral_models" / "Pixtral"),
)
OTLP_ENDPOINT = os.environ.get("OTLP_ENDPOINT", None)  # e.g. "localhost:4317"

tracer = build_tracer(service_name="pixtral-service", otlp_endpoint=OTLP_ENDPOINT)
service = PixtralService(model_path=MODEL_PATH, tracer=tracer)

if __name__ == "__main__":
    result = service.run(
        prompt="Describe what you see in this image.",
        image_urls=[
            "https://huggingface.co/datasets/patrickvonplaten/random_img/resolve/main/yosemite.png"
        ],
        max_tokens=256,
        temperature=0.35,
    )
    print(result)
```

To run against SigNoz:

```bash
export PIXTRAL_MODEL_PATH="$HOME/mistral_models/Pixtral"
export OTLP_ENDPOINT="localhost:4317"
python run_inference.py
```

## Step 4: Understand the span schema

Every call to `service.run(...)` produces one span named `pixtral.inference` with these attributes:

| Attribute | Type | Description |
|---|---|---|
| `inference.model` | string | Model identifier |
| `inference.num_images` | int | Images in this request |
| `inference.max_tokens` | int | Generation budget |
| `inference.temperature` | float | Sampling temperature |
| `inference.prompt_token_count` | int | Total tokens in the encoded prompt (text + image) |
| `inference.image_token_count` | int | Tokens attributable to image inputs |
| `inference.output_token_count` | int | Tokens in the generated response |
| `inference.latency_ms` | float | Wall-clock ms from tokenization to decode |

The image token count is computed by differencing the full encoded prompt against a text-only re-encoding of the same prompt. This is a reliable approach because the tokenizer is deterministic and the overhead of a second tokenization call is negligible compared to GPU inference time.

> [!PULLQUOTE]
> Without structured span data capturing per-modality token counts, a TTFT regression could come from image encoding, from prompt length, or from KV cache pressure, and there is no way to distinguish them from aggregate metrics alone.

## Verify it works

The verification block below runs entirely without a GPU or model weights. It imports the modules, constructs a `PixtralService` with a fake model path, and confirms that the tracer wiring and span attribute logic work correctly by running the tokenization path against `mistral-common` directly (which does not require the model weights). The span is emitted to a `ConsoleSpanExporter` captured in memory.

```python
import io
import sys
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter
from opentelemetry.sdk.resources import Resource
from opentelemetry import trace as otel_trace

# Build an in-memory console tracer
buf = io.StringIO()
resource = Resource.create({"service.name": "pixtral-test"})
provider = TracerProvider(resource=resource)
exporter = ConsoleSpanExporter(out=buf)
provider.add_span_processor(SimpleSpanProcessor(exporter))
otel_trace.set_tracer_provider(provider)
tracer = otel_trace.get_tracer("pixtral-test")

# Emit a span with the same attribute schema used by PixtralService
with tracer.start_as_current_span("pixtral.inference") as span:
    span.set_attribute("inference.model", "Pixtral-12B-2409")
    span.set_attribute("inference.num_images", 1)
    span.set_attribute("inference.max_tokens", 256)
    span.set_attribute("inference.temperature", 0.35)
    span.set_attribute("inference.prompt_token_count", 512)
    span.set_attribute("inference.image_token_count", 480)
    span.set_attribute("inference.output_token_count", 64)
    span.set_attribute("inference.latency_ms", 1234.56)

output = buf.getvalue()
assert "pixtral.inference" in output, "span name missing from output"
assert "inference.image_token_count" in output, "image token count attribute missing"
assert "inference.latency_ms" in output, "latency attribute missing"
assert "Pixtral-12B-2409" in output, "model attribute missing"
print("span schema verification passed")
```

You should see `span schema verification passed`. The console exporter output also shows the full span JSON, which you can inspect to confirm every attribute is present before connecting to a live SigNoz instance.

### Verify module imports

```python
# Confirm all service modules import cleanly
from tracer_config import build_tracer
from pixtral_service import PixtralService
print("module imports ok")
```

### Verify the tracer builder

```python
from tracer_config import build_tracer
from opentelemetry import trace

t = build_tracer(service_name="verify-svc", otlp_endpoint=None)
assert isinstance(t, trace.Tracer)
print("tracer builder ok")
```

## Querying traces in SigNoz

Once the SigNoz stack is running and you have executed at least one real inference call, open `http://localhost:3301`, navigate to **Traces**, and filter by `service.name = pixtral-service`. Each trace contains one root span (`pixtral.inference`). Use the attribute filter panel to group by `inference.num_images` or plot `inference.latency_ms` as a histogram to identify latency outliers correlated with image count.

The same OTLP payload works with any OpenTelemetry-compatible backend. Switching to Grafana Tempo, Jaeger, or a commercial vendor (Honeycomb, Datadog, New Relic) requires only changing the `otlp_endpoint` value and, for vendors that require authentication, setting `OTEL_EXPORTER_OTLP_HEADERS` with the appropriate bearer token. The span schema and attribute names remain identical.

## Troubleshooting

**`ModuleNotFoundError: No module named 'mistral_inference'`** — The package name on PyPI is `mistral-inference` (hyphen), but the import uses an underscore. Confirm the install with `uv pip install "mistral-inference>=1.4.0"` and check `importlib.metadata.version("mistral-inference")` returns 1.4.0 or higher [1].

**`RuntimeError: CUDA out of memory` during model load** — Pixtral-12B in bfloat16 requires approximately 24 GB of VRAM. If your GPU has less, try loading with `torch_dtype=torch.float8_e4m3fn` if your hardware supports it, or use a quantized checkpoint. The reference implementation in mistral-inference does not currently expose a built-in quantization flag, so you would need to apply quantization before saving the weights.

**`huggingface_hub.utils._errors.GatedRepoError`** — You must request access to `mistralai/Pixtral-12B-2409` on the Hugging Face model page before `snapshot_download` will succeed. Approval is typically instant. Also confirm `HUGGING_FACE_HUB_TOKEN` is exported in your shell.

**Spans appear in the console but not in SigNoz** — Confirm the SigNoz stack is healthy with `docker compose ps` and that port 4317 is reachable from your Python process (`nc -zv localhost 4317`). The `OTLPSpanExporter` is constructed with `insecure=True`, which disables TLS. If your SigNoz deployment uses TLS, remove that flag and configure certificates.

**`image_token_count` is always 0** — This happens when the image URL is unreachable at tokenization time and the tokenizer silently skips the image chunk. Add a `requests.head(url, timeout=5)` check before building the `ChatCompletionRequest` to surface connectivity issues early.

**`BatchSpanProcessor` drops spans on short-lived scripts** — The batch processor flushes asynchronously. For scripts that exit immediately after inference, call `trace.get_tracer_provider().force_flush()` before the process exits, or switch to `SimpleSpanProcessor` during development.

## Next steps

- **Add a span per image**: wrap each `ImageURLChunk` encoding in a child span to get per-image latency breakdowns, useful when requests contain variable numbers of images.
- **Expose a FastAPI endpoint**: wrap `PixtralService.run` in a POST handler and add OTel middleware so HTTP request spans are automatically linked to the `pixtral.inference` child span via W3C trace context propagation.
- **Track KV cache hit rates**: mistral-inference exposes generation metadata in the second return value of `generate`. Unpack it and record cache statistics as additional span attributes to correlate TTFT with cache pressure.
- **Benchmark image resolution vs. latency**: log `inference.image_token_count` alongside image dimensions (fetch from PIL before encoding) to build a regression model for capacity planning.

## FAQ

### What does mistral-inference v1.4.0 add for Pixtral support?

Version 1.4.0 is the first release of mistral-inference to support multimodal inference with Pixtral-12B, enabling image tokenization alongside text prompts in a single encoded request.

### How are image token counts calculated in the span attributes?

Image token count is computed by differencing the full encoded prompt (text plus images) against a text-only re-encoding of the same prompt, yielding the tokens attributable to image inputs.

### What are the hardware requirements to run Pixtral-12B locally?

A CUDA-capable GPU with at least 24 GB VRAM is required, as Pixtral-12B in bfloat16 consumes approximately 24 GB of memory.

### Can the OTLP traces be sent to backends other than SigNoz?

Yes, the same OTLP/gRPC payload works with any OpenTelemetry-compatible backend such as Grafana Tempo, Jaeger, Honeycomb, Datadog, or New Relic by changing the otlp_endpoint value and adding authentication headers if required.

### Why is structured span tracing important for multimodal inference?

Without per-modality token counts in spans, a latency regression cannot be attributed to image encoding, prompt length, or KV cache pressure; structured spans enable diagnosis from the first production request.

## References

1. https://github.com/mistralai/mistral-inference/releases/tag/v1.4.0
2. https://github.com/mistralai/mistral-inference/releases/tag/v1.1.0
