Mistral-Medium-3.5-128B-W4A16

Multimodal build. This release keeps the vision tower in BF16 and quantizes only the language layers. Vision is fully active — unlike the previous text-only preview, you can pass images without restriction.

Summary

A W4A16 / W4G128 weight-only quantization of Mistral-Medium-3.5-128B, produced with AutoRound (auto-round-mllm) and exported in the llm_compressor format (compressed-tensors pack-quantized).

  • Weights: 4-bit (group-size 128, symmetric)
  • Activations: 16-bit (BF16)
  • Vision tower: kept in BF16, fully active
  • Export format: compressed-tensors (HF) + Mistral consolidated format (vLLM-native), shipped side-by-side in the same repo

Quantization Details

Parameter Value
Method AutoRound (auto-round-mllm)
Bits 4
Group size 128
Symmetric Yes
Calibration iterations 200
Calibration samples 512
Sequence length 2048
Calibration batch size 8
Export format llm_compressor (compressed-tensors pack-quantized)
Layers kept in BF16 full vision tower (48 layers), multi_modal_projector, lm_head, embeddings, norms

The observer: memoryless_minmax field visible in quantization_config is just a metadata tag of the llm_compressor export format — not the actual calibration method. AutoRound performs block-wise gradient-based optimization of the rounding values, comparable in quality to GPTQ/AWQ.

Reproduction command
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 auto-round-mllm \
  --model mistralai/Mistral-Medium-3.5-128B \
  --device 0,1,2,3,4,5,6,7 \
  --bits 4 --group_size 128 \
  --iters 200 --nsamples 512 --seqlen 2048 \
  --batch_size 8 --gradient_accumulate_steps 1 \
  --low_gpu_mem_usage \
  --format "llm_compressor" \
  --output_dir ./Mistral-Medium-3.5-128B-W4A16
Tuning summary (timings & per-block loss)
  • Total tuning time: ~25 h 25 min on 8 GPUs (91 596 s, ~17 min/block)

  • Layers quantized: 616 / 956 (LLM q/k/v/o + gate/up/down over 88 blocks). The remaining 340 layers — full vision tower (48 × 7), multi_modal_projector (3), and lm_head — are kept in BF16, matching the ignore list in config.json.

  • Per-block loss after AutoRound optimization (final iter, log scale, lower is better):

    Block range Best loss range
    0 – 9 ~1 × 10⁻⁶
    10 – 29 1 × 10⁻⁶ – 2 × 10⁻⁵
    30 – 49 3 × 10⁻⁵ – 2 × 10⁻⁴
    50 – 69 2 × 10⁻⁴ – 6 × 10⁻⁴
    70 – 87 6 × 10⁻⁴ – 6 × 10⁻³

    The monotonic increase with depth is a typical AutoRound block-wise distillation pattern (later layers accumulate residual quantization error from earlier ones). All 88 blocks converged within 200 iterations without divergence. The final block (87) finished at 5.6 × 10⁻³, which is the upper bound of the residual error budget for this run.

Recommended Hardware

  • 192 GB of VRAM: 8 × 24 GB (RTX 3090) or equivalent
  • Can run on ≥ 96 GB VRAM (RTX PRO 6000, H100 NVL, MI300X, ...)

vLLM (Recommended)

We recommend using Mistral Medium 3.5 with the vLLM library for production-ready inference.

To speed up local inference using vLLM, check out our released EAGLE model

Installation

Make sure to install vllm nightly:

uv pip install -U vllm \
   --torch-backend=auto \
   --extra-index-url https://wheels.vllm.ai/nightly

Doing so should automatically install mistral_common >= 1.11.1 and transformers >= 5.4.0.

To check:

python -c "import mistral_common; print(mistral_common.__version__)"
python -c "import transformers; print(transformers.__version__)"

You can also make use of a ready-to-go docker image or on the docker hub.

Serve the Model

We recommend a server/client setup:

# 192 GB VRAM - Full context
vllm serve plezan/Mistral-Medium-3.5-128B-W4A16 \
--tokenizer-mode mistral \
--tool-call-parser mistral \
--reasoning-parser mistral \
--enable-auto-tool-choice \
--max-num-seqs 8 \
--tensor-parallel-size 8

# 96 GB VRAM - 32k context
vllm serve plezan/Mistral-Medium-3.5-128B-W4A16 \
--tokenizer-mode mistral \
--tool-call-parser mistral \
--reasoning-parser mistral \
--enable-auto-tool-choice \
--max-num-seqs 8 \
--tensor-parallel-size 4 \
--max-model-len 32000

Ping the Server

Instruction Following

Mistral Medium 3.5 can follow your instructions to the letter.

from datetime import datetime, timedelta

from huggingface_hub import hf_hub_download
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

REASONING_EFFORT = "none" # Toggle reasoning with 'high'.

match REASONING_EFFORT:
    case "none":
        TEMP = 0.1
        TOP_P = None
    case "high":
        TEMP = 0.7
        TOP_P = 0.95
    case _:
        raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": "Write me a sentence where every word starts with the next letter in the alphabet - start with 'a' and end with 'z'.",
    },
]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    reasoning_effort=REASONING_EFFORT,
    temperature=TEMP,
    top_p=TOP_P,
)

print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)
Tool Call

Let's solve some equations thanks to our simple Python calculator tool.

import json
from datetime import datetime, timedelta

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

REASONING_EFFORT = "none" # Toggle reasoning with 'high'.

match REASONING_EFFORT:
    case "none":
        TEMP = 0.1
        TOP_P = None
    case "high":
        TEMP = 0.7
        TOP_P = 0.95
    case _:
        raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

image_url = "https://math-coaching.com/img/fiche/46/expressions-mathematiques.jpg"


def my_calculator(expression: str) -> str:
    return str(eval(expression))


tools = [
    {
        "type": "function",
        "function": {
            "name": "my_calculator",
            "description": "A calculator that can evaluate a mathematical expression.",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {
                        "type": "string",
                        "description": "The mathematical expression to evaluate.",
                    },
                },
                "required": ["expression"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "rewrite",
            "description": "Rewrite a given text for improved clarity",
            "parameters": {
                "type": "object",
                "properties": {
                    "text": {
                        "type": "string",
                        "description": "The input text to rewrite",
                    }
                },
            },
        },
    },
]

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Thanks to your calculator, compute the results for the equations that involve numbers displayed in the image.",
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": image_url,
                },
            },
        ],
    },
]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    tools=tools,
    tool_choice="auto",
    reasoning_effort=REASONING_EFFORT,
    temperature=TEMP,
    top_p=TOP_P,
)

tool_calls = response.choices[0].message.tool_calls

results = []
for tool_call in tool_calls:
    function_name = tool_call.function.name
    function_args = tool_call.function.arguments
    if function_name == "my_calculator":
        result = my_calculator(**json.loads(function_args))
        results.append(result)

messages.append({"role": "assistant", "tool_calls": tool_calls})
for tool_call, result in zip(tool_calls, results):
    messages.append(
        {
            "role": "tool",
            "tool_call_id": tool_call.id,
            "name": tool_call.function.name,
            "content": result,
        }
    )


response = client.chat.completions.create(
    model=model,
    messages=messages,
    reasoning_effort=REASONING_EFFORT,
    temperature=TEMP,
    top_p=TOP_P,
)

print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)
Vision Reasoning

Let's see if the Mistral Medium 3.5 knows when to pick a fight !

from datetime import datetime, timedelta

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

REASONING_EFFORT = "high" # Remove reasoning with 'none'.

match REASONING_EFFORT:
    case "none":
        TEMP = 0.1
        TOP_P = None
    case "high":
        TEMP = 0.7
        TOP_P = 0.95
    case _:
        raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
            },
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]


response = client.chat.completions.create(
    model=model,
    messages=messages,
    reasoning_effort=REASONING_EFFORT,
    temperature=TEMP,
    top_p=TOP_P,
)

print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)
Speculative decoding (EAGLE)

Append the --speculative-config block to either of the commands above:

  --speculative-config '{
    "model": "mistralai/Mistral-Medium-3.5-128B-EAGLE",
    "num_speculative_tokens": 3,
    "method": "eagle",
    "max_model_len": 65536
  }'

Inference with transformers

from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "plezan/Mistral-Medium-3.5-128B-W4A16"
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

messages = [{"role": "user", "content": [{"type": "text", "text": "Bonjour !"}]}]
inputs = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", tokenize=True).to(model.device)
out = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(out[0], skip_special_tokens=True))

Requires a recent transformers build with the Mistral3ForConditionalGeneration architecture.

Repository internals

This repo ships two parallel sets of weights for the same parameters: HF (model-*.safetensors + config.json) for transformers, and Mistral consolidated (consolidated-*.safetensors + params.json) for vllm. Both are functionally identical — pick the one your loader prefers.

Why dual format?

The consolidated format lets vllm use --load-format auto (or mistral), which dispatches the Pixtral class and the native MistralCommonImageProcessor. This path avoids the incompatibility between transformers.PixtralProcessor and MistralCommonBackend that causes the string "[IMG]" to tokenize into 3 tokens instead of 1, triggering ValueError: Mismatch in image token count.

If you only use transformers, the HF set is enough; if you only use vllm, the consolidated set is enough.

W4A16 storage convention (HF/Llama vs Mistral permutation)

For the W4A16 LLM layers (weight_packed, weight_scale, weight_shape), tensors are stored in HF/Llama convention (not in classic Mistral convention). This is intentional: vLLM only knows how to apply the Q/K permutation to suffixes weight and qscale_weight, so compressed-tensors tensors must be pre-permuted. The vision part (BF16) is stored in classic Mistral convention. If a future vLLM version extends the permutation to compressed-tensors suffixes, this repo will need to be regenerated.

Known Limitations

  • Quality: W4A16 weight-only quantization may introduce slight degradation on tasks requiring high numerical precision or multi-step reasoning. The 200-iteration AutoRound run reduces this vs naive RTN, but it does not fully match BF16.
  • vLLM dependency: the tokenizer and parsers require --tokenizer-mode mistral.
  • Dual-format footprint: the repo carries two copies of the weights (HF + consolidated, ~70 GB each) to avoid the [IMG] tokenization bug in transformers.PixtralProcessor. See the Repository internals section above.
  • Preview: weights may be updated in-place without creating a new repository (same URL, new revision).

Evaluation

Not yet evaluated. PPL (wikitext), MMLU and multimodal benchmarks (DocVQA, MMMU) are planned and will be reported here.

License

Model Weights

This quantized model is a derivative of mistralai/Mistral-Medium-3.5-128B. The weights are licensed under the same Modified MIT License as the original model.

You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party's rights, including intellectual property rights.

Repository Code

The AutoRound quantization configuration, model card, and inference examples provided in this repository are released under the Apache 2.0 License.

Downloads last month
1,141
Safetensors
Model size
22B params
Tensor type
I64
·
I32
·
BF16
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for plezan/Mistral-Medium-3.5-128B-W4A16

Quantized
(15)
this model