Qwen2.5-14B-Instruct - MLX mxfp4 Quantized

Repository: johnlockejrr/Qwen2.5-14B-Instruct-mxfp4
Base model: https://huggingface.co/Qwen/Qwen2.5-14B-Instruct
Quantization: MLX mxfp4 (4-bit)
Quantized by: johnlockejrr
Framework: MLX + mlx-lm
Quantization tool: https://github.com/EricFillion/quantize

Model Summary

This repository contains an MLX-quantized version of Qwen2.5-14B-Instruct, optimized for Apple Silicon (M1/M2/M3/M4) devices.
The model was quantized to mxfp4 (4-bit) using the MLX-based quantization tool by Eric Fillion, reducing memory usage from approximately 29–30 GB (FP16) to approximately 10–11 GB while preserving strong instruction‑following and reasoning performance.

This quantized model is ideal for:

local assistants
offline workflows
VS Code integration
fast inference on Apple GPUs
running large models on 16 GB or 24 GB Apple Silicon machines

Quantization Details

Setting	Value
Quantization mode	mxfp4
Bits per weight	4
Group size	64
Activation dtype	bfloat16
Framework	MLX
Quantization tool	EricFillion/quantize

Command used:

python3 quantize.py \
  --model_name Qwen/Qwen2.5-14B-Instruct \
  --save_model_path models/Qwen2.5-14B-Instruct-mxfp4 \
  --q_mode mxfp4 \
  --q_bits 4 \
  --q_group_size 64

Resulting model size: approximately 10–11 GB

Running the Model (MLX)

CLI (mx-lm)

mlx_lm.generate \
  --model johnlockejrr/Qwen2.5-14B-Instruct-mxfp4 \
  --prompt "Explain the concept of recursion with an example." \
  --max-tokens 512

Python API

from mlx_lm import load, generate

model, tokenizer = load("johnlockejrr/Qwen2.5-14B-Instruct-mxfp4")

prompt = "Give me a short explanation of transformers in machine learning."

output = generate(model, tokenizer, prompt, max_tokens=200)
print(output)

Chat Mode

from mlx_lm import load, chat

model, tokenizer = load("johnlockejrr/Qwen2.5-14B-Instruct-mxfp4")

messages = [
    {"role": "user", "content": "What is a monad in functional programming?"}
]

response = chat(model, tokenizer, messages)
print(response)

Performance (Mac Mini M4, 16 GB)

Metric	Value
Generation speed	approximately 8-12 tokens/sec
Peak memory usage	approximately 10.5 G B
GPU	Apple M4 GPU
Framework	MLX

This model runs smoothly on:

Mac Mini M4 (16 GB / 24 GB)
MacBook Pro M3/M4
Mac Studio M2/M3/M4

Repository Contents

model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
model.safetensors.index.json
config.json
tokenizer.json
tokenizer_config.json
chat_template.jinja
generation_config.json
README.md

License

This model inherits the license of the original model:

Qwen2.5-14B-Instruct License: https://huggingface.co/Qwen/Qwen2.5-14B-Instruct#license (huggingface.co in Bing) (bing.com in Bing)

Please review the license before using this model in commercial applications.

Limitations and Bias

The model may generate incorrect or incomplete information.
It may hallucinate facts or APIs.
It may produce biased or harmful content if prompted.
It should not be used for production‑critical tasks without human review.

Acknowledgements

Qwen Team for the original Qwen2.5‑14B‑Instruct model
Apple MLX Team for the MLX framework
Eric Fillion for the MLX quantization tool
Hugging Face for hosting the model

Downloads last month: 108

Safetensors

Model size

15B params

Tensor type

BF16

U32

MLX

Hardware compatibility

4-bit

Model tree for johnlockejrr/Qwen2.5-14B-Instruct-mxfp4

Base model

Qwen/Qwen2.5-14B

Finetuned

Qwen/Qwen2.5-14B-Instruct

Quantized

(133)

this model