Qwen2.5-14B-Instruct - MLX mxfp4 Quantized
- Repository:
johnlockejrr/Qwen2.5-14B-Instruct-mxfp4 - Base model: https://huggingface.co/Qwen/Qwen2.5-14B-Instruct
- Quantization: MLX mxfp4 (4-bit)
- Quantized by:
johnlockejrr - Framework: MLX + mlx-lm
- Quantization tool: https://github.com/EricFillion/quantize
Model Summary
This repository contains an MLX-quantized version of Qwen2.5-14B-Instruct, optimized for Apple Silicon (M1/M2/M3/M4) devices.
The model was quantized to mxfp4 (4-bit) using the MLX-based quantization tool by Eric Fillion, reducing memory usage from approximately 29–30 GB (FP16) to approximately 10–11 GB while preserving strong instruction‑following and reasoning performance.
This quantized model is ideal for:
- local assistants
- offline workflows
- VS Code integration
- fast inference on Apple GPUs
- running large models on 16 GB or 24 GB Apple Silicon machines
Quantization Details
| Setting | Value |
|---|---|
| Quantization mode | mxfp4 |
| Bits per weight | 4 |
| Group size | 64 |
| Activation dtype | bfloat16 |
| Framework | MLX |
| Quantization tool | EricFillion/quantize |
Command used:
python3 quantize.py \
--model_name Qwen/Qwen2.5-14B-Instruct \
--save_model_path models/Qwen2.5-14B-Instruct-mxfp4 \
--q_mode mxfp4 \
--q_bits 4 \
--q_group_size 64
Resulting model size: approximately 10–11 GB
Running the Model (MLX)
CLI (mx-lm)
mlx_lm.generate \
--model johnlockejrr/Qwen2.5-14B-Instruct-mxfp4 \
--prompt "Explain the concept of recursion with an example." \
--max-tokens 512
Python API
from mlx_lm import load, generate
model, tokenizer = load("johnlockejrr/Qwen2.5-14B-Instruct-mxfp4")
prompt = "Give me a short explanation of transformers in machine learning."
output = generate(model, tokenizer, prompt, max_tokens=200)
print(output)
Chat Mode
from mlx_lm import load, chat
model, tokenizer = load("johnlockejrr/Qwen2.5-14B-Instruct-mxfp4")
messages = [
{"role": "user", "content": "What is a monad in functional programming?"}
]
response = chat(model, tokenizer, messages)
print(response)
Performance (Mac Mini M4, 16 GB)
| Metric | Value |
|---|---|
| Generation speed | approximately 8-12 tokens/sec |
| Peak memory usage | approximately 10.5 G B |
| GPU | Apple M4 GPU |
| Framework | MLX |
This model runs smoothly on:
- Mac Mini M4 (16 GB / 24 GB)
- MacBook Pro M3/M4
- Mac Studio M2/M3/M4
Repository Contents
model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
model.safetensors.index.json
config.json
tokenizer.json
tokenizer_config.json
chat_template.jinja
generation_config.json
README.md
License
This model inherits the license of the original model:
Qwen2.5-14B-Instruct License: https://huggingface.co/Qwen/Qwen2.5-14B-Instruct#license (huggingface.co in Bing) (bing.com in Bing)
Please review the license before using this model in commercial applications.
Limitations and Bias
- The model may generate incorrect or incomplete information.
- It may hallucinate facts or APIs.
- It may produce biased or harmful content if prompted.
- It should not be used for production‑critical tasks without human review.
Acknowledgements
- Qwen Team for the original Qwen2.5‑14B‑Instruct model
- Apple MLX Team for the MLX framework
- Eric Fillion for the MLX quantization tool
- Hugging Face for hosting the model
- Downloads last month
- 108
4-bit