Qwen2.5-14B-Instruct - MLX mxfp4 Quantized


Model Summary

This repository contains an MLX-quantized version of Qwen2.5-14B-Instruct, optimized for Apple Silicon (M1/M2/M3/M4) devices.
The model was quantized to mxfp4 (4-bit) using the MLX-based quantization tool by Eric Fillion, reducing memory usage from approximately 29–30 GB (FP16) to approximately 10–11 GB while preserving strong instruction‑following and reasoning performance.

This quantized model is ideal for:

  • local assistants
  • offline workflows
  • VS Code integration
  • fast inference on Apple GPUs
  • running large models on 16 GB or 24 GB Apple Silicon machines

Quantization Details

Setting Value
Quantization mode mxfp4
Bits per weight 4
Group size 64
Activation dtype bfloat16
Framework MLX
Quantization tool EricFillion/quantize

Command used:

python3 quantize.py \
  --model_name Qwen/Qwen2.5-14B-Instruct \
  --save_model_path models/Qwen2.5-14B-Instruct-mxfp4 \
  --q_mode mxfp4 \
  --q_bits 4 \
  --q_group_size 64

Resulting model size: approximately 10–11 GB


Running the Model (MLX)

CLI (mx-lm)

mlx_lm.generate \
  --model johnlockejrr/Qwen2.5-14B-Instruct-mxfp4 \
  --prompt "Explain the concept of recursion with an example." \
  --max-tokens 512

Python API

from mlx_lm import load, generate

model, tokenizer = load("johnlockejrr/Qwen2.5-14B-Instruct-mxfp4")

prompt = "Give me a short explanation of transformers in machine learning."

output = generate(model, tokenizer, prompt, max_tokens=200)
print(output)

Chat Mode

from mlx_lm import load, chat

model, tokenizer = load("johnlockejrr/Qwen2.5-14B-Instruct-mxfp4")

messages = [
    {"role": "user", "content": "What is a monad in functional programming?"}
]

response = chat(model, tokenizer, messages)
print(response)

Performance (Mac Mini M4, 16 GB)

Metric Value
Generation speed approximately 8-12 tokens/sec
Peak memory usage approximately 10.5 G B
GPU Apple M4 GPU
Framework MLX

This model runs smoothly on:

  • Mac Mini M4 (16 GB / 24 GB)
  • MacBook Pro M3/M4
  • Mac Studio M2/M3/M4

Repository Contents

model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
model.safetensors.index.json
config.json
tokenizer.json
tokenizer_config.json
chat_template.jinja
generation_config.json
README.md

License

This model inherits the license of the original model:

Qwen2.5-14B-Instruct License: https://huggingface.co/Qwen/Qwen2.5-14B-Instruct#license (huggingface.co in Bing) (bing.com in Bing)

Please review the license before using this model in commercial applications.


Limitations and Bias

  • The model may generate incorrect or incomplete information.
  • It may hallucinate facts or APIs.
  • It may produce biased or harmful content if prompted.
  • It should not be used for production‑critical tasks without human review.

Acknowledgements

  • Qwen Team for the original Qwen2.5‑14B‑Instruct model
  • Apple MLX Team for the MLX framework
  • Eric Fillion for the MLX quantization tool
  • Hugging Face for hosting the model
Downloads last month
108
Safetensors
Model size
15B params
Tensor type
BF16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for johnlockejrr/Qwen2.5-14B-Instruct-mxfp4

Base model

Qwen/Qwen2.5-14B
Quantized
(133)
this model