Instructions to use plezan/Mistral-Medium-3.5-128B-W4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use plezan/Mistral-Medium-3.5-128B-W4A16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="plezan/Mistral-Medium-3.5-128B-W4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("plezan/Mistral-Medium-3.5-128B-W4A16")
model = AutoModelForImageTextToText.from_pretrained("plezan/Mistral-Medium-3.5-128B-W4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use plezan/Mistral-Medium-3.5-128B-W4A16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "plezan/Mistral-Medium-3.5-128B-W4A16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "plezan/Mistral-Medium-3.5-128B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/plezan/Mistral-Medium-3.5-128B-W4A16

SGLang

How to use plezan/Mistral-Medium-3.5-128B-W4A16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "plezan/Mistral-Medium-3.5-128B-W4A16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "plezan/Mistral-Medium-3.5-128B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "plezan/Mistral-Medium-3.5-128B-W4A16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "plezan/Mistral-Medium-3.5-128B-W4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use plezan/Mistral-Medium-3.5-128B-W4A16 with Docker Model Runner:
```
docker model run hf.co/plezan/Mistral-Medium-3.5-128B-W4A16
```

Mistral-Medium-3.5-128B-W4A16

Multimodal build. This release keeps the vision tower in BF16 and quantizes only the language layers. Vision is fully active — unlike the previous text-only preview, you can pass images without restriction.

Summary

A W4A16 / W4G128 weight-only quantization of Mistral-Medium-3.5-128B, produced with AutoRound (auto-round-mllm) and exported in the llm_compressor format (compressed-tensors pack-quantized).

Weights: 4-bit (group-size 128, symmetric)
Activations: 16-bit (BF16)
Vision tower: kept in BF16, fully active
Export format: compressed-tensors (HF) + Mistral consolidated format (vLLM-native), shipped side-by-side in the same repo

Quantization Details

Parameter	Value
Method	AutoRound (`auto-round-mllm`)
Bits	4
Group size	128
Symmetric	Yes
Calibration iterations	200
Calibration samples	512
Sequence length	2048
Calibration batch size	8
Export format	`llm_compressor` (compressed-tensors pack-quantized)
Layers kept in BF16	full vision tower (48 layers), `multi_modal_projector`, `lm_head`, embeddings, norms

The observer: memoryless_minmax field visible in quantization_config is just a metadata tag of the llm_compressor export format — not the actual calibration method. AutoRound performs block-wise gradient-based optimization of the rounding values, comparable in quality to GPTQ/AWQ.

Reproduction command

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 auto-round-mllm \
  --model mistralai/Mistral-Medium-3.5-128B \
  --device 0,1,2,3,4,5,6,7 \
  --bits 4 --group_size 128 \
  --iters 200 --nsamples 512 --seqlen 2048 \
  --batch_size 8 --gradient_accumulate_steps 1 \
  --low_gpu_mem_usage \
  --format "llm_compressor" \
  --output_dir ./Mistral-Medium-3.5-128B-W4A16

Tuning summary (timings & per-block loss)

Total tuning time: ~25 h 25 min on 8 GPUs (91 596 s, ~17 min/block)
Layers quantized: 616 / 956 (LLM q/k/v/o + gate/up/down over 88 blocks). The remaining 340 layers — full vision tower (48 × 7), multi_modal_projector (3), and lm_head — are kept in BF16, matching the ignore list in config.json.

Per-block loss after AutoRound optimization (final iter, log scale, lower is better):

Block range	Best loss range
0 – 9	~1 × 10⁻⁶
10 – 29	1 × 10⁻⁶ – 2 × 10⁻⁵
30 – 49	3 × 10⁻⁵ – 2 × 10⁻⁴
50 – 69	2 × 10⁻⁴ – 6 × 10⁻⁴
70 – 87	6 × 10⁻⁴ – 6 × 10⁻³

The monotonic increase with depth is a typical AutoRound block-wise distillation pattern (later layers accumulate residual quantization error from earlier ones). All 88 blocks converged within 200 iterations without divergence. The final block (87) finished at 5.6 × 10⁻³, which is the upper bound of the residual error budget for this run.

Recommended Hardware

192 GB of VRAM: 8 × 24 GB (RTX 3090) or equivalent
Can run on ≥ 96 GB VRAM (RTX PRO 6000, H100 NVL, MI300X, ...)

vLLM (Recommended)

We recommend using Mistral Medium 3.5 with the vLLM library for production-ready inference.

To speed up local inference using vLLM, check out our released EAGLE model

Installation

Make sure to install vllm nightly:

uv pip install -U vllm \
   --torch-backend=auto \
   --extra-index-url https://wheels.vllm.ai/nightly

Doing so should automatically install mistral_common >= 1.11.1 and transformers >= 5.4.0.

To check:

python -c "import mistral_common; print(mistral_common.__version__)"
python -c "import transformers; print(transformers.__version__)"

You can also make use of a ready-to-go docker image or on the docker hub.

Serve the Model

We recommend a server/client setup:

# 192 GB VRAM - Full context
vllm serve plezan/Mistral-Medium-3.5-128B-W4A16 \
--tokenizer-mode mistral \
--tool-call-parser mistral \
--reasoning-parser mistral \
--enable-auto-tool-choice \
--max-num-seqs 8 \
--tensor-parallel-size 8

# 96 GB VRAM - 32k context
vllm serve plezan/Mistral-Medium-3.5-128B-W4A16 \
--tokenizer-mode mistral \
--tool-call-parser mistral \
--reasoning-parser mistral \
--enable-auto-tool-choice \
--max-num-seqs 8 \
--tensor-parallel-size 4 \
--max-model-len 32000

Ping the Server

Instruction Following

Mistral Medium 3.5 can follow your instructions to the letter.

from datetime import datetime, timedelta

from huggingface_hub import hf_hub_download
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

REASONING_EFFORT = "none" # Toggle reasoning with 'high'.

match REASONING_EFFORT:
    case "none":
        TEMP = 0.1
        TOP_P = None
    case "high":
        TEMP = 0.7
        TOP_P = 0.95
    case _:
        raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": "Write me a sentence where every word starts with the next letter in the alphabet - start with 'a' and end with 'z'.",
    },
]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    reasoning_effort=REASONING_EFFORT,
    temperature=TEMP,
    top_p=TOP_P,
)

print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)

Tool Call

Let's solve some equations thanks to our simple Python calculator tool.

import json
from datetime import datetime, timedelta

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

REASONING_EFFORT = "none" # Toggle reasoning with 'high'.

match REASONING_EFFORT:
    case "none":
        TEMP = 0.1
        TOP_P = None
    case "high":
        TEMP = 0.7
        TOP_P = 0.95
    case _:
        raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")

image_url = "https://math-coaching.com/img/fiche/46/expressions-mathematiques.jpg"


def my_calculator(expression: str) -> str:
    return str(eval(expression))


tools = [
    {
        "type": "function",
        "function": {
            "name": "my_calculator",
            "description": "A calculator that can evaluate a mathematical expression.",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {
                        "type": "string",
                        "description": "The mathematical expression to evaluate.",
                    },
                },
                "required": ["expression"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "rewrite",
            "description": "Rewrite a given text for improved clarity",
            "parameters": {
                "type": "object",
                "properties": {
                    "text": {
                        "type": "string",
                        "description": "The input text to rewrite",
                    }
                },
            },
        },
    },
]

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Thanks to your calculator, compute the results for the equations that involve numbers displayed in the image.",
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": image_url,
                },
            },
        ],
    },
]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    tools=tools,
    tool_choice="auto",
    reasoning_effort=REASONING_EFFORT,
    temperature=TEMP,
    top_p=TOP_P,
)

tool_calls = response.choices[0].message.tool_calls

results = []
for tool_call in tool_calls:
    function_name = tool_call.function.name
    function_args = tool_call.function.arguments
    if function_name == "my_calculator":
        result = my_calculator(**json.loads(function_args))
        results.append(result)

messages.append({"role": "assistant", "tool_calls": tool_calls})
for tool_call, result in zip(tool_calls, results):
    messages.append(
        {
            "role": "tool",
            "tool_call_id": tool_call.id,
            "name": tool_call.function.name,
            "content": result,
        }
    )


response = client.chat.completions.create(
    model=model,
    messages=messages,
    reasoning_effort=REASONING_EFFORT,
    temperature=TEMP,
    top_p=TOP_P,
)

print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)

Vision Reasoning

Let's see if the Mistral Medium 3.5 knows when to pick a fight !

from datetime import datetime, timedelta

from openai import OpenAI
from huggingface_hub import hf_hub_download

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

REASONING_EFFORT = "high" # Remove reasoning with 'none'.

match REASONING_EFFORT:
    case "none":
        TEMP = 0.1
        TOP_P = None
    case "high":
        TEMP = 0.7
        TOP_P = 0.95
    case _:
        raise ValueError("Only REASONING_EFFORT in ['none', 'high'] are supported.")

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id


def load_system_prompt(repo_id: str, filename: str) -> str:
    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
    with open(file_path, "r") as file:
        system_prompt = file.read()
    today = datetime.today().strftime("%Y-%m-%d")
    yesterday = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")
    model_name = repo_id.split("/")[-1]
    return system_prompt.format(name=model_name, today=today, yesterday=yesterday)


SYSTEM_PROMPT = load_system_prompt(model, "SYSTEM_PROMPT.txt")
image_url = "https://static.wikia.nocookie.net/essentialsdocs/images/7/70/Battle.png/revision/latest?cb=20220523172438"

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "What action do you think I should take in this situation? List all the possible actions and explain why you think they are good or bad.",
            },
            {"type": "image_url", "image_url": {"url": image_url}},
        ],
    },
]


response = client.chat.completions.create(
    model=model,
    messages=messages,
    reasoning_effort=REASONING_EFFORT,
    temperature=TEMP,
    top_p=TOP_P,
)

print("==============================================================")
print(f"Request with {REASONING_EFFORT=}, {TEMP=} and {TOP_P=}.")
print("==============================================================")
print("REASONING")
print("~~~~~~~~~")
print(response.choices[0].message.reasoning)
print("==============================================================")
print("CONTENT")
print("~~~~~~~")
print(response.choices[0].message.content)

Speculative decoding (EAGLE)

Append the --speculative-config block to either of the commands above:

  --speculative-config '{
    "model": "mistralai/Mistral-Medium-3.5-128B-EAGLE",
    "num_speculative_tokens": 3,
    "method": "eagle",
    "max_model_len": 65536
  }'

Inference with `transformers`

from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "plezan/Mistral-Medium-3.5-128B-W4A16"
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

messages = [{"role": "user", "content": [{"type": "text", "text": "Bonjour !"}]}]
inputs = processor.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt", tokenize=True).to(model.device)
out = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(out[0], skip_special_tokens=True))

Requires a recent transformers build with the Mistral3ForConditionalGeneration architecture.

Repository internals

This repo ships two parallel sets of weights for the same parameters: HF (model-*.safetensors + config.json) for transformers, and Mistral consolidated (consolidated-*.safetensors + params.json) for vllm. Both are functionally identical — pick the one your loader prefers.

Why dual format?

The consolidated format lets vllm use --load-format auto (or mistral), which dispatches the Pixtral class and the native MistralCommonImageProcessor. This path avoids the incompatibility between transformers.PixtralProcessor and MistralCommonBackend that causes the string "[IMG]" to tokenize into 3 tokens instead of 1, triggering ValueError: Mismatch in image token count.

If you only use transformers, the HF set is enough; if you only use vllm, the consolidated set is enough.

W4A16 storage convention (HF/Llama vs Mistral permutation)

For the W4A16 LLM layers (weight_packed, weight_scale, weight_shape), tensors are stored in HF/Llama convention (not in classic Mistral convention). This is intentional: vLLM only knows how to apply the Q/K permutation to suffixes weight and qscale_weight, so compressed-tensors tensors must be pre-permuted. The vision part (BF16) is stored in classic Mistral convention. If a future vLLM version extends the permutation to compressed-tensors suffixes, this repo will need to be regenerated.

Known Limitations

Quality: W4A16 weight-only quantization may introduce slight degradation on tasks requiring high numerical precision or multi-step reasoning. The 200-iteration AutoRound run reduces this vs naive RTN, but it does not fully match BF16.
vLLM dependency: the tokenizer and parsers require --tokenizer-mode mistral.
Dual-format footprint: the repo carries two copies of the weights (HF + consolidated, ~70 GB each) to avoid the [IMG] tokenization bug in transformers.PixtralProcessor. See the Repository internals section above.
Preview: weights may be updated in-place without creating a new repository (same URL, new revision).

Evaluation

Not yet evaluated. PPL (wikitext), MMLU and multimodal benchmarks (DocVQA, MMMU) are planned and will be reported here.

License

Model Weights

This quantized model is a derivative of mistralai/Mistral-Medium-3.5-128B. The weights are licensed under the same Modified MIT License as the original model.

You must not use this model in a manner that infringes, misappropriates, or otherwise violates any third party's rights, including intellectual property rights.

Repository Code

The AutoRound quantization configuration, model card, and inference examples provided in this repository are released under the Apache 2.0 License.

Downloads last month: 1,141

Safetensors

Model size

22B params

Tensor type

I64

I32

BF16

F16

Model tree for plezan/Mistral-Medium-3.5-128B-W4A16

Base model

mistralai/Mistral-Medium-3.5-128B

Quantized

(15)

this model