Text Generation
MLX
Safetensors
English
Chinese
deepseek_v4
apple-silicon
deepseek
deepseek-v4
mixture-of-experts
Mixture of Experts
quantized
4-bit precision
8-bit precision
affine
Instructions to use Deviad/DeepSeek-V4-Flash-MLX-Q4Q8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use Deviad/DeepSeek-V4-Flash-MLX-Q4Q8 with MLX:
# Make sure mlx-lm is installed # pip install --upgrade mlx-lm # if on a CUDA device, also pip install mlx[cuda] # Generate text with mlx-lm from mlx_lm import load, generate model, tokenizer = load("Deviad/DeepSeek-V4-Flash-MLX-Q4Q8") prompt = "Once upon a time in" text = generate(model, tokenizer, prompt=prompt, verbose=True) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
- MLX LM
How to use Deviad/DeepSeek-V4-Flash-MLX-Q4Q8 with MLX LM:
Generate or start a chat session
# Install MLX LM uv tool install mlx-lm # Generate some text mlx_lm.generate --model "Deviad/DeepSeek-V4-Flash-MLX-Q4Q8" --prompt "Once upon a time"
File size: 11,969 Bytes
399a4fb 78a20cd 399a4fb 1fcb87f 399a4fb 78a20cd 399a4fb 1fcb87f 399a4fb | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 | ---
license: mit
license_link: LICENSE
base_model: deepseek-ai/DeepSeek-V4-Flash
base_model_relation: quantized
library_name: mlx
pipeline_tag: text-generation
tags:
- mlx
- apple-silicon
- deepseek
- deepseek-v4
- mixture-of-experts
- moe
- quantized
- 4-bit
- 8-bit
- affine
language:
- en
- zh
inference: false
---
# DeepSeek-V4-Flash-MLX-Q4Q8
A mixed-precision MLX quantization of [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash)
intended for Apple-Silicon inference via [vMLX](https://vmlx.ai/) (or any
MLX-aware runtime that loads `mlx_lm.utils.load`).
- **Architecture**: DeepSeek-V4 — 289.9 B total parameters, 256 routed
experts (top-6 per token), 1 shared expert, 43 layers, MLA attention
with `head_dim=512` and grouped output projection, mHC
(Manifold-Constrained Hyper-Connections, `hc_mult=4`),
sqrtsoftplus + hash routing for the first 3 layers.
- **Quantization**: standard MLX `affine` mode (output of `mx.quantize`,
not TurboQuant). Tensor naming `<module>.{weight, scales, biases}`.
Group size 32. Layout in safetensors:
- **routed experts** (`layers.N.ffn.experts.E.{w1,w2,w3}`): 4-bit
- **attention** (`layers.N.attn.{wq_a, wkv, wo_a, wo_b, ...}`): 8-bit
- **shared expert, embed_tokens, lm_head**: 8-bit
- **norms, router gate, mHC params**: fp16 (passthrough)
- **On-disk size**: 173 GB across 159 safetensors shards.
- **Context**: 1,048,576 tokens (sliding-window=128 short-prompt-safe).
## Usage with vMLX
The bundle is a drop-in replacement for the upstream FP4/FP8 release in
vMLX 1.3.97+. Two non-obvious considerations:
### 1. Runtime patch required (`jang_tools.load_jangtq`)
vMLX's bundled `jang_tools.load_jangtq._patch_quant_config_inplace`
(`/Applications/vMLX.app/.../jang_tools/load_jangtq.py`) infers
quantization overrides from raw safetensors keys
(`model.layers.N.ffn.experts.E.w1`) — these never match the
post-`sanitize()` module paths the MLX `Model` exposes
(`model.layers.N.mlp.switch_mlp.gate_proj`), so it overwrites this
bundle's correct config with unmatchable disk-keyed entries. After
overwrite, `mlx_lm`'s `class_predicate` falls through to top-level
`bits=8` and the routed experts get wrapped as 8-bit modules. The
4-bit-packed weights then silently fail to load (with `strict=False`)
and the model produces BOS-token loops at inference.
The fix is a 4-line guard at the top of `_patch_quant_config_inplace`
that returns early when the user's config already has post-sanitize
overrides:
```python
if existing_overrides and any(k.startswith("model.") for k in existing_overrides):
return {"action": "user_provided", "existing_overrides": len(existing_overrides)}
```
The accompanying [`build_mlx_q4q8.sh`](#building-from-source) script's
`patch_loader` step applies this idempotently. See
[`requantization-plan.md`](#building-from-source) for the full diagnosis.
### 2. SimpleEngine only
vMLX auto-disables `--continuous-batching` for DSV4 because the
batched generator is incompatible with the model's 4-D mHC residual
stream. All requests go through SimpleEngine. Throughput on
Mac Studio M3 Ultra (256 GB unified memory): ~22 tok/s decode,
~75 tok/s prefill.
### Serving
```bash
/Applications/vMLX.app/Contents/Resources/bundled-python/python/bin/python3 \
-m vmlx_engine.cli serve \
/path/to/DeepSeek-V4-Flash-MLX-Q4Q8 \
--served-model-name deepseek-v4-flash-mlx-q4q8 \
--host 127.0.0.1 --port 8010 \
--max-tokens 4096 \
--tool-call-parser deepseek \
--enable-auto-tool-choice
```
Then hit it with the OpenAI-compatible chat-completion API:
```bash
curl -s http://127.0.0.1:8010/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "deepseek-v4-flash-mlx-q4q8",
"messages": [{"role": "user", "content": "What is 17+28?"}],
"max_tokens": 120
}'
```
The model is reasoning-capable (`<think>...</think>` blocks land in
`reasoning_content`; the final answer in `content`).
## Hardware requirements
- Apple Silicon (M1 Max / M2 Ultra / M3 Ultra recommended).
- **Unified memory**: ≥ 192 GB strongly recommended; the bundle's
173 GB working set plus KV cache plus a 70 % wired-limit headroom
(configured automatically by `jang_tools.load_jangtq._apply_wired_limit_safe_default`)
needs comfortable spillover. Will technically load on 128 GB with
reduced max-tokens, but expect SSD pressure.
- macOS 14+ for the Metal kernels used by the routed-expert SwitchGLU.
## Tool calling & reasoning
The bundle ships with the DSML tool-call grammar
(`|DSML|` / `<|tool_calls|>` / `<|invoke|>`); pair it with vMLX's
`--tool-call-parser deepseek --enable-auto-tool-choice`. Reasoning
modes:
- **chat** (default): direct response, no `<think>` block.
- **thinking**: emits `<think>...</think>` wrapped reasoning, parsed
out into `reasoning_content` by `DeepSeekR1ReasoningParser`.
Both modes set the `<|latest_reminder|>` anchor automatically — vMLX
adds a default system prompt (`DSV4: injected default system prompt`
in the load log) to keep multi-turn chat from running away on
reasoning loops.
## Quantization details
This release is the output of:
1. Convert from upstream FP4 (routed experts) + FP8 (others) using
`jang_tools.dsv4.convert_dsv4_jangtq --profile 4 --format jang`.
2. **Re-quantize the routed expert tensors** from the FP4 source
through `mx.quantize(..., group_size=32, bits=4, mode="affine")`.
The upstream converter direct-copies FP4 onto disk in MXFP4 form
(uint8 E8M0 scales, no biases) regardless of `--format`; vMLX's
MXFP4 dispatch is broken at 4-bit and produces gibberish. The
re-quantization step rewrites `.weight + .scales + .biases` for
each of the 33,024 routed expert tensors using MLX's actual affine
formula:
```
scale = max((w_max - w_min) / 15, eps)
side = abs(w_min) > abs(w_max)
scale = side ? scale : -scale
edge = side ? w_min : w_max
q0 = round(edge / scale)
scale = (q0 != 0) ? edge / q0 : scale
bias = (q0 != 0) ? edge : 0
```
(matches `mlx/include/mlx/backend/metal/kernels/quantized.h:2387`).
3. Rebuild `model.safetensors.index.json` to include the
newly-introduced `.biases` keys.
### Size vs. quality tradeoff
This bundle is **173 GB** on disk vs. **~149 GB** for the upstream
FP8 (non-experts) + FP4 (experts) release — about 24 GB of overhead.
The extra space comes from MLX's affine quantization scheme:
- **group_size = 32** (vs. upstream's 128×128 blocks): finer-grained
scales mean less quantization error per group, but more
scale/bias metadata per tensor.
- **non-experts at Q8 affine** (vs. upstream FP8 block): keeps
attention, router, shared expert, embed/lm_head at 8-bit affine,
which is quality-sensitive and small in total — cheap to spend
bits on.
- **experts at Q4 affine** (vs. upstream MXFP4): same nominal width,
but affine adds per-group `bias` tensors that MXFP4 doesn't carry.
The choice is deliberate and quality-leaning rather than
size-leaning. Rough perplexity deltas vs. bf16 (extrapolated from
published llama.cpp / MLX quantization studies — not measured on
V4-Flash specifically):
| Knob | Size saved | Quality cost |
|----------------------------|------------|----------------------------------|
| group_size 32 → 64 | ~6–8 GB | +0.1–0.3 % PPL |
| group_size 32 → 128 | ~10–12 GB | +0.3–0.8 % PPL |
| Non-experts Q8 → Q6 | ~3–5 GB | +0.1–0.3 % PPL |
| Non-experts Q8 → Q4 | ~8–10 GB | +0.5–2 % PPL, noticeable on long-context / reasoning |
| Experts Q4 → Q3 | ~30–40 GB | +2–6 % PPL, real degradation |
The current config is essentially lossless (<1 % PPL increase).
**A more space-balanced alternative for 192 GB Macs**: keep Q8
non-experts + Q4 experts but bump to `group_size=64` — saves ~6–8 GB,
quality loss is in the noise. Going below Q4 on the experts is where
MoE models fall off a cliff (each token only sees 6 of 256 experts,
so quantization noise does not average out across the population),
and gs=128 starts to bite on 1M-token contexts where small per-token
errors compound.
Net: the 24 GB overhead is the price of (a) MLX compatibility — there
is no MLX kernel for DeepSeek's native FP8-block / MXFP4 layout — and
(b) a config that errs on the side of preserving quality over
shaving space.
The community `mxfp4_to_affine.py` script that ships in some upstream
DSV4 conversion guides uses `scale = (max-min)/15, bias = min`, which
**does not** match MLX's affine convention. Bundles produced that way
load but compound quantization error across the 43 transformer layers
(activations explode by layer ~20, NaN by layer ~29) and emit BOS-loop
gibberish. Do not use that script.
## Files in this bundle
```
.
├── config.json # 132 quantization entries (129 routed-expert per-module + globals)
├── jang_config.json # vMLX chat / reasoning / tool-call schema
├── generation_config.json # eos_token_id = [1, 128803, 128804]
├── tokenizer.json
├── tokenizer_config.json # embedded chat_template + special tokens
├── encoding/ # DSV4 encoding adapter
├── model-00001-of-00159.safetensors # 159 shards, total ~173 GB
│ ...
├── model.safetensors.index.json
├── LICENSE
├── README.md # this file
├── README.upstream.md # upstream DeepSeek-V4 model card
└── DeepSeek_V4.pdf # upstream tech report
```
## Building from source
The full pipeline (download → convert → re-quantize → finalize → patch
→ verify) is automated in
[`build_mlx_q4q8.sh`](https://huggingface.co/Deviad/DeepSeek-V4-Flash-MLX-Q4Q8/blob/main/build_mlx_q4q8.sh) (companion script in the
project repo). Quick reference of the steps:
```
./build_mlx_q4q8.sh check # sanity-check disks + tools
./build_mlx_q4q8.sh patch_loader # apply the load_jangtq.py guard
./build_mlx_q4q8.sh download # hf download deepseek-ai/DeepSeek-V4-Flash
./build_mlx_q4q8.sh convert # ~40 min: jang_tools convert_dsv4_jangtq
./build_mlx_q4q8.sh requantize # ~30 min: mx.quantize routed experts
./build_mlx_q4q8.sh finalize # tokenizer / encoding asset copy
./build_mlx_q4q8.sh patch # EOS / chat_template fixes
./build_mlx_q4q8.sh verify # check the bundle
./build_mlx_q4q8.sh serve # launch vMLX
```
`./build_mlx_q4q8.sh all` runs everything in order. Total runtime on
M3 Ultra: ~75 minutes plus the initial download (~160 GB at ~150 MB/s =
~18 minutes on a fast link).
See [`requantization-plan.md`](https://huggingface.co/Deviad/DeepSeek-V4-Flash-MLX-Q4Q8/blob/main/requantization-plan.md) for the
diagnostic write-up of why the requantize step is needed.
## License & attribution
This bundle is licensed under MIT, matching the upstream
[DeepSeek-V4-Flash license](LICENSE).
The original model and tech report are credited to the
[DeepSeek-AI](https://www.deepseek.com/) team. Please cite their work
when using this model:
```
@misc{deepseekv4,
title = {DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
author = {DeepSeek-AI},
year = {2025},
url = {https://github.com/deepseek-ai/DeepSeek-V4}
}
```
The MLX-Q4Q8 quantization recipe is provided as-is and adds nothing
substantive to the science — it is purely a packaging artifact for
running the model on Apple-Silicon hardware.
## Acknowledgments
- DeepSeek-AI for the base model and the open-source release.
- The MLX team at Apple for the framework and the
`mlx.core.quantize` reference implementation.
- The vMLX team for the `jang_tools` tooling and the `load_jangtq`
loader (modulo the patch noted above).
|