Ministral-3-8B-Instruct (Vision-Language & vLLM Compatible)

Ministral-3-8B-Instruct is a vision-aware, instruction-tuned multimodal language model developed by Mistral AI. It combines textual and visual understanding with strong reasoning capabilities and reliable instruction adherence.

This repository provides Q4_K_M and Q5_K_M quantized variants of the model, optimized for efficient local inference. These quantized formats reduce memory usage and improve inference performance while retaining support for vision-language interaction with both text and image inputs.

Model Overview

Model Name: Ministral-3-8B-Instruct
Base Model: mistralai/Ministral-3-8B-Instruct-2512
Architecture: Transformer-based multimodal model
Parameter Count: 8 Billion
Contexts Supported: Text & Images
Developer: Mistral AI
License: Apache 2.0

Quantization Formats

Q4_K_M

Approx. ~~71% size reduction (~~4.84 GB)
Substantial reduction in model size
Designed for low-memory environments
Faster inference on CPU-based systems
Suitable for lightweight and edge use cases

Q5_K_M

Approx. ~~66% size reduction (~~5.64 GB)
Higher precision than Q4 variants
Improved response consistency and reasoning depth
Recommended for balanced performance and quality

Vision-Language Capabilities

Ministral-3-8B-Instruct supports multimodal inputs, allowing users to provide both text and images within the same prompt. This enables applications such as:

Image captioning and explanation
Visual question answering
Instruction following grounded in vision
Contextual multimodal analysis

The model processes textual and visual information jointly, producing coherent responses that factor in both modalities.

Training Background

The base model was pretrained on a large mixture of text and visual data, followed by instruction tuning that emphasizes reliable multimodal reasoning and instruction compliance.

Pretraining

Large-scale multimodal pretraining
Joint text-image representation learning
Optimization for robust, coherent generation

Instruction Tuning

Fine-tuned with multimodal instruction datasets
Trained for clarity, task adherence, and visual reasoning
Enhanced for conversational quality across modalities

Key Capabilities

Multimodal Input Understanding
Incorporates image content and text together to produce aligned responses.
Instruction Compliance
Follows detailed user directives, including ones involving visual context.
Reasoning & Analysis
Supports step-by-step explanation and problem solving, integrating visual evidence.
Conversational Dialogue
Maintains fluid dialogue across mixed text-image interaction.
Efficient vLLM Serving
Works well with vLLM inference for scalable deployment.

Usage Examples

LLama.cpp Usage

/llama-cli \
  -m SandlogicTechnologies\ministral-3-8b-instruct_Q4_K_M.gguf \
  --image ./example.png \
  -p "Explain what is happening in this image."

Recommended Applications

Multimodal Assistants Build systems that understand and respond to both images and text.
Visual QA Tools Create applications that answer questions grounded in image context.
Content Understanding Use for summarizing or reasoning over documents with associated images.
Conversational AI Serve rich, multimodal dialogues in high-throughput environments.

Acknowledgments

This repository is based on the Ministral-3-8B-Instruct model, developed by Mistral AI.

Thanks to:

The Mistral AI team for releasing multimodal capabilities
The llama.cpp community for enabling efficient GGUF inference

Contact

For questions, feedback, or support, please reach out at support@sandlogic.com or visit https://www.sandlogic.com/

Downloads last month: 32

GGUF

Model size

8B params

Architecture

mistral3

Hardware compatibility

4-bit

5-bit

Model tree for SandLogicTechnologies/Ministral-3-8B-Instruct-2512-GGUF

Base model

mistralai/Ministral-3-8B-Base-2512

Quantized

mistralai/Ministral-3-8B-Instruct-2512

Quantized

(27)

this model