What Changed in v1.4.0

Mistral AI published mistral-inference v1.4.0 to its GitHub repository, marking the library’s first release with multimodal vision capabilities [1]. The update centers on support for Pixtral, a new model family that accepts image inputs alongside text prompts. Prior versions of the library handled only text-based interactions; the v1.4.0 release extends the core inference pipeline to process visual data through both the command-line interface and the Python API.

The upgrade path is a standard pip command: pip install --upgrade mistral_inference with a version pin of 1.4.0 or higher [1].

The Pixtral-12B-2409 Model

The specific model introduced in this release is Pixtral-12B-2409, a 12-billion-parameter vision-language model hosted on Hugging Face under the mistralai/Pixtral-12B-2409 repository [1]. Downloading the model requires three files: params.json, consolidated.safetensors, and tekken.json. The Hugging Face snapshot_download utility handles retrieval, storing the files to a local directory such as ~/mistral_models/Pixtral.

The .safetensors format is consistent with how Mistral AI has distributed other recent weights, and the tekken.json file serves as the tokenizer configuration specific to this model.

How Image Input Works

Once the model is downloaded, the CLI entry point accepts image inputs interactively. After a user submits a text prompt, the interface prompts for zero or more image paths or URLs before generating a response [1]. A session might pass a public URL such as https://picsum.photos/id/237/200/300, after which the model returns a description of the image content.

The CLI invocation follows the same pattern used for text-only models:

mistral-chat $HOME/mistral_models/Pixtral --instruct --max_tokens 256 --temperature 0.35

Local file paths and remote URLs are both accepted at the image input prompt, giving operators flexibility to test with on-disk assets or publicly accessible images without additional preprocessing.

Python API and Tokenizer Changes

The Python API introduces two new message chunk types from the mistral_common package: ImageURLChunk and TextChunk [1]. These are composed inside a UserMessage content list and passed to a ChatCompletionRequest, replacing the plain string content used in text-only workflows.

A minimal example constructs the request as follows:

completion_request = ChatCompletionRequest(
    messages=[UserMessage(content=[ImageURLChunk(image_url=url), TextChunk(text=prompt)])]
)

The MistralTokenizer is loaded from the model-specific tekken.json file rather than from a model name string, a change from the pattern used in v1.3.0 [1][2]. The tokenizer’s encode_chat_completion method returns both a tokens tensor and an images object, the latter carrying the preprocessed visual data that the Transformer model consumes during generation.

Context: Prior Release and Progression

The immediately preceding release, v1.3.0, introduced Mistral-Nemo, a 12-billion-parameter text model developed in collaboration with NVIDIA [2]. That release added function-calling support and used MistralTokenizer.from_model("mistral-nemo") for tokenizer initialization. The progression from v1.3.0 to v1.4.0 reflects a pattern of adding one major capability per minor version, with the tokenizer API evolving to accommodate new modalities.

Who Can Use This and How to Get Started

Running Pixtral-12B-2409 locally requires hardware capable of loading a 12-billion-parameter model in safetensors format, which in practice means a GPU with sufficient VRAM for the consolidated weights file [1]. Operators already running other Mistral 12B models on their infrastructure should find the hardware requirements comparable.

The setup sequence involves three steps: upgrading mistral-inference to v1.4.0 or later, downloading the model files from Hugging Face using snapshot_download, and then invoking either the CLI or the Python API with the local model directory path. No additional vision-specific dependencies beyond the standard mistral-inference package are listed in the release notes.

FAQ

Q. Can images be passed as local file paths rather than URLs in the Python API? The CLI explicitly accepts both local paths and URLs at its image input prompt [1]. The Python API’s ImageURLChunk type accepts a URL field, so operators using local files through the Python interface may need to construct a file URI or load image bytes depending on how the underlying library resolves the field.

Q. Does v1.4.0 break compatibility with text-only workflows from v1.3.0? The core Transformer and generate imports remain unchanged between v1.3.0 and v1.4.0 [1][2]. The primary difference is that the tokenizer for Pixtral must be initialized from a tekken.json file rather than a model name string, which affects only code targeting the new model.

Q. Where are the Pixtral-12B-2409 weights hosted, and are they freely downloadable? The weights are hosted on Hugging Face at mistralai/Pixtral-12B-2409 and are retrieved via the snapshot_download function from the huggingface_hub package [1]. Access is subject to Hugging Face’s standard repository access controls and any terms Mistral AI attaches to the model repository.

Q. What temperature and token settings does Mistral AI suggest for Pixtral? The release notes show a CLI example using --max_tokens 256 and --temperature 0.35 [1]. These are illustrative defaults from the documentation rather than formally recommended production settings.

Key takeaways

  • mistral-inference v1.4.0 adds multimodal vision support for the first time, centered on the Pixtral-12B-2409 model [1].
  • Images can be supplied as URLs or local paths through both the CLI and the Python API, with no separate vision package required [1].
  • The Python API introduces ImageURLChunk and TextChunk types that compose inside UserMessage content lists, and the tokenizer is now loaded from a model-specific tekken.json file [1].
  • The release follows v1.3.0’s introduction of Mistral-Nemo, continuing a pattern of one major capability addition per minor version [2].
  • Hardware requirements are comparable to other Mistral 12B deployments, and installation requires only a standard pip upgrade.

Frequently Asked Questions

How does the Python API handle image inputs in mistral-inference v1.4.0?

The Python API introduces ImageURLChunk and TextChunk types that are composed inside a UserMessage content list and passed to a ChatCompletionRequest. The tokenizer’s encode_chat_completion method returns both a tokens tensor and an images object carrying preprocessed visual data for the model.

What files are needed to run Pixtral-12B-2409 locally?

Three files are required: params.json, consolidated.safetensors, and tekken.json. These are downloaded from the mistralai/Pixtral-12B-2409 repository on Hugging Face using the snapshot_download utility.

Does v1.4.0 break existing text-only workflows from v1.3.0?

The core Transformer and generate imports remain unchanged between versions. The primary difference is that Pixtral’s tokenizer must be initialized from a tekken.json file rather than a model name string, which affects only code targeting the new model.

Can images be passed as local file paths in the CLI?

Yes, the CLI explicitly accepts both local file paths and remote URLs at the image input prompt. The Python API’s ImageURLChunk type accepts a URL field, so local file handling may require constructing a file URI depending on the implementation.

What are the hardware requirements for running Pixtral-12B-2409?

A GPU with sufficient VRAM to load the 12-billion-parameter model in safetensors format is required. Hardware requirements are comparable to other Mistral 12B model deployments.