Mistral-Medium-3.5-128B-W4A16
Multimodal build. This release keeps the vision tower in BF16 and quantizes only the language layers. Vision is fully active — unlike the previous text-only preview, you can pass images without restriction.
Summary
A W4A16 / W4G128 weight-only quantization of Mistral-Medium-3.5-128B, produced with AutoRound (auto-round-mllm) and exported in the llm_compressor format (compressed-tensors pack-quantized).
- Weights: 4-bit (group-size 128, symmetric)
- Activations: 16-bit (BF16)
- Vision tower: kept in BF16, fully active
- Export format:
compressed-tensors(HF) + Mistral consolidated format (vLLM-native), shipped side-by-side in the same repo
Quantization Details
| Parameter | Value |
|---|---|
| Method | AutoRound (auto-round-mllm) |
| Bits | 4 |
| Group size | 128 |
| Symmetric | Yes |
| Calibration iterations | 200 |
| Calibration samples | 512 |
| Sequence length | 2048 |
| Calibration batch size | 8 |
| Export format | llm_compressor (compressed-tensors pack-quantized) |
| Layers kept in BF16 | full vision tower (48 layers), multi_modal_projector, lm_head, embeddings, norms |
The
observer: memoryless_minmaxfield visible inquantization_configis just a metadata tag of thellm_compressorexport format — not the actual calibration method. AutoRound performs block-wise gradient-based optimization of the rounding values, comparable in quality to GPTQ/AWQ.
Reproduction command
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 auto-round-mllm \
--model mistralai/Mistral-Medium-3.5-128B \
--device 0,1,2,3,4,5,6,7 \
--bits 4 --group_size 128 \
--iters 200 --nsamples 512 --seqlen 2048 \
--batch_size 8 --gradient_accumulate_steps 1 \
--low_gpu_mem_usage \
--format "llm_compressor" \
--output_dir ./Mistral-Medium-3.5-128B-W4A16
Tuning summary (timings & per-block loss)
Total tuning time: ~25 h 25 min on 8 GPUs (91 596 s, ~17 min/block)
Layers quantized: 616 / 956 (LLM
q/k/v/o+gate/up/downover 88 blocks). The remaining 340 layers — full vision tower (48 × 7),multi_modal_projector(3), andlm_head— are kept in BF16, matching theignorelist inconfig.json.Per-block loss after AutoRound optimization (final iter, log scale, lower is better):
Block range Best loss range 0 – 9 ~1 × 10⁻⁶ 10 – 29 1 × 10⁻⁶ – 2 × 10⁻⁵ 30 – 49 3 × 10⁻⁵ – 2 × 10⁻⁴ 50 – 69 2 × 10⁻⁴ – 6 × 10⁻⁴ 70 – 87 6 × 10⁻⁴ – 6 × 10⁻³ The monotonic increase with depth is a typical AutoRound block-wise distillation pattern (later layers accumulate residual quantization error from earlier ones). All 88 blocks converged within 200 iterations without divergence. The final block (87) finished at 5.6 × 10⁻³, which is the upper bound of the residual error budget for this run.
Recommended Hardware
- 192 GB of VRAM: 8 × 24 GB (RTX 3090) or equivalent
- Can run on ≥ 96 GB VRAM (RTX PRO 6000, H100 NVL, MI300X, ...)
vLLM (Recommended)
We recommend using Mistral Medium 3.5 with the vLLM library for production-ready inference.
To speed up local inference using vLLM, check out our released EAGLE model
Installation
Make sure to install vllm nightly:
uv pip install -U vllm \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightly
Doing so should automatically install mistral_common >= 1.11.1 and transformers >= 5.4.0.
To check:
python -c "import mistral_common; print(mistral_common.__version__)"
python -c "import transformers; print(transformers.__version__)"
You can also make use of a ready-to-go docker image or on the docker hub.
Serve the Model
We recommend a server/client setup:
# 192 GB VRAM - Full context
vllm serve plezan/Mistral-Medium-3.5-128B-W4A16 \
--tokenizer-mode mistral \
--tool-call-parser mistral \
--reasoning-parser mistral \
--enable-auto-tool-choice \
--max-num-seqs 8 \
--tensor-parallel-size 8
# 96 GB VRAM - 32k context
vllm serve plezan/Mistral-Medium-3.5-128B-W4A16 \
--tokenizer-mode mistral \
--tool-call-parser mistral \
--reasoning-parser mistral \
--enable-auto-tool-choice \
--max-num-seqs 8 \
--tensor-parallel-size 4 \
--max-model-len 32000
Ping the Server
Instruction Following
Mistral Medium 3.5 can follow your instructions to the letter.
from datetime import datetime, timedelta
from huggingface_hub import hf_hub_download
from openai import OpenAI
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
REASONING_EFFORT = "none" # Toggle reasoning with 'high'.
match REASONING_EFFORT:
case "none":
TEMP = 0.1
TOP_P = None
case "high":
TEMP = 0.7
TOP_P = 0.95
case _:
raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": "Write me a sentence where every word starts with the next letter in the alphabet - start with 'a' and end with 'z'.",
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
reasoning_effort=REASONING_EFFORT,
temperature=TEMP,
top_p=TOP_P,
)
print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)
Tool Call
Let's solve some equations thanks to our simple Python calculator tool.
import json
from datetime import datetime, timedelta
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
REASONING_EFFORT = "none" # Toggle reasoning with 'high'.
match REASONING_EFFORT:
case "none":
TEMP = 0.1
TOP_P = None
case "high":
TEMP = 0.7
TOP_P = 0.95
case _:
raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://math-coaching.com/img/fiche/46/expressions-mathematiques.jpg"
def my_calculator(expression: str) -> str:
return str(eval(expression))
tools = [
{
"type": "function",
"function": {
"name": "my_calculator",
"description": "A calculator that can evaluate a mathematical expression.",
"parameters": {
"type": "object",
"properties": {
"expression": {
"type": "string",
"description": "The mathematical expression to evaluate.",
},
},
"required": ["expression"],
},
},
},
{
"type": "function",
"function": {
"name": "rewrite",
"description": "Rewrite a given text for improved clarity",
"parameters": {
"type": "object",
"properties": {
"text": {
"type": "string",
"description": "The input text to rewrite",
}
},
},
},
},
]
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{
"type": "text",
"text": "Thanks to your calculator, compute the results for the equations that involve numbers displayed in the image.",
},
{
"type": "image_url",
"image_url": {
"url": image_url,
},
},
],
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
tools=tools,
tool_choice="auto",
reasoning_effort=REASONING_EFFORT,
temperature=TEMP,
top_p=TOP_P,
)
tool_calls = response.choices[0].message.tool_calls
results = []
for tool_call in tool_calls:
function_name = tool_call.function.name
function_args = tool_call.function.arguments
if function_name == "my_calculator":
result = my_calculator(**json.loads(function_args))
results.append(result)
messages.append({"role": "assistant", "tool_calls": tool_calls})
for tool_call, result in zip(tool_calls, results):
messages.append(
{
"role": "tool",
"tool_call_id": tool_call.id,
"name": tool_call.function.name,
"content": result,
}
)
response = client.chat.completions.create(
model=model,
messages=messages,
reasoning_effort=REASONING_EFFORT,
temperature=TEMP,
top_p=TOP_P,
)
print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)
Vision Reasoning
Let's see if the Mistral Medium 3.5 knows when to pick a fight !
from datetime import datetime, timedelta
from openai import OpenAI
from huggingface_hub import hf_hub_download
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
REASONING_EFFORT = "high" # Remove reasoning with 'none'.
match REASONING_EFFORT:
case "none":
TEMP = 0.1
TOP_P = None
case "high":
TEMP = 0.7
TOP_P = 0.95
case _:
raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
def load_system_prompt(repo_id: str, filename: str) -> str:
file_path = hf_hub_download(repo_id=repo_id, filename=filename)
with open(file_path, "r") as file:
system_prompt = file.read()
today = datetime.today().strftime("%Y-%m-%d")
yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
model_name = repo_id.split("/")[-1]
return system_prompt.format(name=model_name, today=today, yesterday=yesterday)
SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{
"role": "user",
"content": [
{
"type": "text",
"text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
},
{"type": "image_url", "image_url": {"url": image_url}},
],
},
]
response = client.chat.completions.create(
model=model,
messages=messages,
reasoning_effort=REASONING_EFFORT,
temperature=TEMP,
top_p=TOP_P,
)
print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)
Speculative decoding (EAGLE)
Append the --speculative-config block to either of the commands above:
--speculative-config '{
"model": "mistralai/Mistral-Medium-3.5-128B-EAGLE",
"num_speculative_tokens": 3,
"method": "eagle",
"max_model_len": 65536
}'
Inference with transformers
from transformers import AutoModelForImageTextToText, AutoProcessor
model_id = "plezan/Mistral-Medium-3.5-128B-W4A16"
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)
messages = [{"role": "user", "content": [{"type": "text", "text": "Bonjour !"}]}]
inputs = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", tokenize=True).to(model.device)
out = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(out[0], skip_special_tokens=True))
Requires a recent transformers build with the Mistral3ForConditionalGeneration architecture.
Repository internals
This repo ships two parallel sets of weights for the same parameters: HF (model-*.safetensors + config.json) for transformers, and Mistral consolidated (consolidated-*.safetensors + params.json) for vllm. Both are functionally identical — pick the one your loader prefers.
Why dual format?
The consolidated format lets vllm use --load-format auto (or mistral), which dispatches the Pixtral class and the native MistralCommonImageProcessor. This path avoids the incompatibility between transformers.PixtralProcessor and MistralCommonBackend that causes the string "[IMG]" to tokenize into 3 tokens instead of 1, triggering ValueError: Mismatch in image token count.
If you only use transformers, the HF set is enough; if you only use vllm, the consolidated set is enough.
W4A16 storage convention (HF/Llama vs Mistral permutation)
For the W4A16 LLM layers (weight_packed, weight_scale, weight_shape), tensors are stored in HF/Llama convention (not in classic Mistral convention). This is intentional: vLLM only knows how to apply the Q/K permutation to suffixes weight and qscale_weight, so compressed-tensors tensors must be pre-permuted. The vision part (BF16) is stored in classic Mistral convention. If a future vLLM version extends the permutation to compressed-tensors suffixes, this repo will need to be regenerated.
Known Limitations
- Quality: W4A16 weight-only quantization may introduce slight degradation on tasks requiring high numerical precision or multi-step reasoning. The 200-iteration AutoRound run reduces this vs naive RTN, but it does not fully match BF16.
- vLLM dependency: the tokenizer and parsers require
--tokenizer-mode mistral. - Dual-format footprint: the repo carries two copies of the weights (HF + consolidated, ~70 GB each) to avoid the
[IMG]tokenization bug intransformers.PixtralProcessor. See the Repository internals section above. - Preview: weights may be updated in-place without creating a new repository (same URL, new revision).
Evaluation
Not yet evaluated. PPL (wikitext), MMLU and multimodal benchmarks (DocVQA, MMMU) are planned and will be reported here.
License
Model Weights
This quantized model is a derivative of mistralai/Mistral-Medium-3.5-128B. The weights are licensed under the same Modified MIT License as the original model.
You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party's rights, including intellectual property rights.
Repository Code
The AutoRound quantization configuration, model card, and inference examples provided in this repository are released under the Apache 2.0 License.
- Downloads last month
- 1,141
Model tree for plezan/Mistral-Medium-3.5-128B-W4A16
Base model
mistralai/Mistral-Medium-3.5-128B