File size: 11,969 Bytes
399a4fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78a20cd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
399a4fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1fcb87f
399a4fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78a20cd
 
399a4fb
1fcb87f
399a4fb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
---
license: mit
license_link: LICENSE
base_model: deepseek-ai/DeepSeek-V4-Flash
base_model_relation: quantized
library_name: mlx
pipeline_tag: text-generation
tags:
  - mlx
  - apple-silicon
  - deepseek
  - deepseek-v4
  - mixture-of-experts
  - moe
  - quantized
  - 4-bit
  - 8-bit
  - affine
language:
  - en
  - zh
inference: false
---

# DeepSeek-V4-Flash-MLX-Q4Q8

A mixed-precision MLX quantization of [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash)
intended for Apple-Silicon inference via [vMLX](https://vmlx.ai/) (or any
MLX-aware runtime that loads `mlx_lm.utils.load`).

- **Architecture**: DeepSeek-V4 — 289.9 B total parameters, 256 routed
  experts (top-6 per token), 1 shared expert, 43 layers, MLA attention
  with `head_dim=512` and grouped output projection, mHC
  (Manifold-Constrained Hyper-Connections, `hc_mult=4`),
  sqrtsoftplus + hash routing for the first 3 layers.
- **Quantization**: standard MLX `affine` mode (output of `mx.quantize`,
  not TurboQuant). Tensor naming `<module>.{weight, scales, biases}`.
  Group size 32. Layout in safetensors:
  - **routed experts** (`layers.N.ffn.experts.E.{w1,w2,w3}`): 4-bit
  - **attention** (`layers.N.attn.{wq_a, wkv, wo_a, wo_b, ...}`): 8-bit
  - **shared expert, embed_tokens, lm_head**: 8-bit
  - **norms, router gate, mHC params**: fp16 (passthrough)
- **On-disk size**: 173 GB across 159 safetensors shards.
- **Context**: 1,048,576 tokens (sliding-window=128 short-prompt-safe).

## Usage with vMLX

The bundle is a drop-in replacement for the upstream FP4/FP8 release in
vMLX 1.3.97+. Two non-obvious considerations:

### 1. Runtime patch required (`jang_tools.load_jangtq`)

vMLX's bundled `jang_tools.load_jangtq._patch_quant_config_inplace`
(`/Applications/vMLX.app/.../jang_tools/load_jangtq.py`) infers
quantization overrides from raw safetensors keys
(`model.layers.N.ffn.experts.E.w1`) — these never match the
post-`sanitize()` module paths the MLX `Model` exposes
(`model.layers.N.mlp.switch_mlp.gate_proj`), so it overwrites this
bundle's correct config with unmatchable disk-keyed entries. After
overwrite, `mlx_lm`'s `class_predicate` falls through to top-level
`bits=8` and the routed experts get wrapped as 8-bit modules. The
4-bit-packed weights then silently fail to load (with `strict=False`)
and the model produces BOS-token loops at inference.

The fix is a 4-line guard at the top of `_patch_quant_config_inplace`
that returns early when the user's config already has post-sanitize
overrides:

```python
if existing_overrides and any(k.startswith("model.") for k in existing_overrides):
    return {"action": "user_provided", "existing_overrides": len(existing_overrides)}
```

The accompanying [`build_mlx_q4q8.sh`](#building-from-source) script's
`patch_loader` step applies this idempotently. See
[`requantization-plan.md`](#building-from-source) for the full diagnosis.

### 2. SimpleEngine only

vMLX auto-disables `--continuous-batching` for DSV4 because the
batched generator is incompatible with the model's 4-D mHC residual
stream. All requests go through SimpleEngine. Throughput on
Mac Studio M3 Ultra (256 GB unified memory): ~22 tok/s decode,
~75 tok/s prefill.

### Serving

```bash
/Applications/vMLX.app/Contents/Resources/bundled-python/python/bin/python3 \
  -m vmlx_engine.cli serve \
  /path/to/DeepSeek-V4-Flash-MLX-Q4Q8 \
  --served-model-name deepseek-v4-flash-mlx-q4q8 \
  --host 127.0.0.1 --port 8010 \
  --max-tokens 4096 \
  --tool-call-parser deepseek \
  --enable-auto-tool-choice
```

Then hit it with the OpenAI-compatible chat-completion API:

```bash
curl -s http://127.0.0.1:8010/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "deepseek-v4-flash-mlx-q4q8",
    "messages": [{"role": "user", "content": "What is 17+28?"}],
    "max_tokens": 120
  }'
```

The model is reasoning-capable (`<think>...</think>` blocks land in
`reasoning_content`; the final answer in `content`).

## Hardware requirements

- Apple Silicon (M1 Max / M2 Ultra / M3 Ultra recommended).
- **Unified memory**: ≥ 192 GB strongly recommended; the bundle's
  173 GB working set plus KV cache plus a 70 % wired-limit headroom
  (configured automatically by `jang_tools.load_jangtq._apply_wired_limit_safe_default`)
  needs comfortable spillover. Will technically load on 128 GB with
  reduced max-tokens, but expect SSD pressure.
- macOS 14+ for the Metal kernels used by the routed-expert SwitchGLU.

## Tool calling & reasoning

The bundle ships with the DSML tool-call grammar
(`|DSML|` / `<|tool_calls|>` / `<|invoke|>`); pair it with vMLX's
`--tool-call-parser deepseek --enable-auto-tool-choice`. Reasoning
modes:

- **chat** (default): direct response, no `<think>` block.
- **thinking**: emits `<think>...</think>` wrapped reasoning, parsed
  out into `reasoning_content` by `DeepSeekR1ReasoningParser`.

Both modes set the `<|latest_reminder|>` anchor automatically — vMLX
adds a default system prompt (`DSV4: injected default system prompt`
in the load log) to keep multi-turn chat from running away on
reasoning loops.

## Quantization details

This release is the output of:

1. Convert from upstream FP4 (routed experts) + FP8 (others) using
   `jang_tools.dsv4.convert_dsv4_jangtq --profile 4 --format jang`.
2. **Re-quantize the routed expert tensors** from the FP4 source
   through `mx.quantize(..., group_size=32, bits=4, mode="affine")`.
   The upstream converter direct-copies FP4 onto disk in MXFP4 form
   (uint8 E8M0 scales, no biases) regardless of `--format`; vMLX's
   MXFP4 dispatch is broken at 4-bit and produces gibberish. The
   re-quantization step rewrites `.weight + .scales + .biases` for
   each of the 33,024 routed expert tensors using MLX's actual affine
   formula:
   ```
   scale = max((w_max - w_min) / 15, eps)
   side  = abs(w_min) > abs(w_max)
   scale = side ? scale : -scale
   edge  = side ? w_min : w_max
   q0    = round(edge / scale)
   scale = (q0 != 0) ? edge / q0 : scale
   bias  = (q0 != 0) ? edge      : 0
   ```
   (matches `mlx/include/mlx/backend/metal/kernels/quantized.h:2387`).
3. Rebuild `model.safetensors.index.json` to include the
   newly-introduced `.biases` keys.

### Size vs. quality tradeoff

This bundle is **173 GB** on disk vs. **~149 GB** for the upstream
FP8 (non-experts) + FP4 (experts) release — about 24 GB of overhead.
The extra space comes from MLX's affine quantization scheme:

- **group_size = 32** (vs. upstream's 128×128 blocks): finer-grained
  scales mean less quantization error per group, but more
  scale/bias metadata per tensor.
- **non-experts at Q8 affine** (vs. upstream FP8 block): keeps
  attention, router, shared expert, embed/lm_head at 8-bit affine,
  which is quality-sensitive and small in total — cheap to spend
  bits on.
- **experts at Q4 affine** (vs. upstream MXFP4): same nominal width,
  but affine adds per-group `bias` tensors that MXFP4 doesn't carry.

The choice is deliberate and quality-leaning rather than
size-leaning. Rough perplexity deltas vs. bf16 (extrapolated from
published llama.cpp / MLX quantization studies — not measured on
V4-Flash specifically):

| Knob                       | Size saved | Quality cost                     |
|----------------------------|------------|----------------------------------|
| group_size 32 → 64         | ~6–8 GB    | +0.1–0.3 % PPL                   |
| group_size 32 → 128        | ~10–12 GB  | +0.3–0.8 % PPL                   |
| Non-experts Q8 → Q6        | ~3–5 GB    | +0.1–0.3 % PPL                   |
| Non-experts Q8 → Q4        | ~8–10 GB   | +0.5–2 % PPL, noticeable on long-context / reasoning |
| Experts Q4 → Q3            | ~30–40 GB  | +2–6 % PPL, real degradation     |

The current config is essentially lossless (<1 % PPL increase).
**A more space-balanced alternative for 192 GB Macs**: keep Q8
non-experts + Q4 experts but bump to `group_size=64` — saves ~6–8 GB,
quality loss is in the noise. Going below Q4 on the experts is where
MoE models fall off a cliff (each token only sees 6 of 256 experts,
so quantization noise does not average out across the population),
and gs=128 starts to bite on 1M-token contexts where small per-token
errors compound.

Net: the 24 GB overhead is the price of (a) MLX compatibility — there
is no MLX kernel for DeepSeek's native FP8-block / MXFP4 layout — and
(b) a config that errs on the side of preserving quality over
shaving space.

The community `mxfp4_to_affine.py` script that ships in some upstream
DSV4 conversion guides uses `scale = (max-min)/15, bias = min`, which
**does not** match MLX's affine convention. Bundles produced that way
load but compound quantization error across the 43 transformer layers
(activations explode by layer ~20, NaN by layer ~29) and emit BOS-loop
gibberish. Do not use that script.

## Files in this bundle

```
.
├── config.json                        # 132 quantization entries (129 routed-expert per-module + globals)
├── jang_config.json                   # vMLX chat / reasoning / tool-call schema
├── generation_config.json             # eos_token_id = [1, 128803, 128804]
├── tokenizer.json
├── tokenizer_config.json              # embedded chat_template + special tokens
├── encoding/                          # DSV4 encoding adapter
├── model-00001-of-00159.safetensors  # 159 shards, total ~173 GB
│   ...
├── model.safetensors.index.json
├── LICENSE
├── README.md                          # this file
├── README.upstream.md                 # upstream DeepSeek-V4 model card
└── DeepSeek_V4.pdf                    # upstream tech report
```

## Building from source

The full pipeline (download → convert → re-quantize → finalize → patch
→ verify) is automated in
[`build_mlx_q4q8.sh`](https://huggingface.co/Deviad/DeepSeek-V4-Flash-MLX-Q4Q8/blob/main/build_mlx_q4q8.sh) (companion script in the
project repo). Quick reference of the steps:

```
./build_mlx_q4q8.sh check         # sanity-check disks + tools
./build_mlx_q4q8.sh patch_loader  # apply the load_jangtq.py guard
./build_mlx_q4q8.sh download      # hf download deepseek-ai/DeepSeek-V4-Flash
./build_mlx_q4q8.sh convert       # ~40 min: jang_tools convert_dsv4_jangtq
./build_mlx_q4q8.sh requantize    # ~30 min: mx.quantize routed experts
./build_mlx_q4q8.sh finalize      # tokenizer / encoding asset copy
./build_mlx_q4q8.sh patch         # EOS / chat_template fixes
./build_mlx_q4q8.sh verify        # check the bundle
./build_mlx_q4q8.sh serve         # launch vMLX
```

`./build_mlx_q4q8.sh all` runs everything in order. Total runtime on
M3 Ultra: ~75 minutes plus the initial download (~160 GB at ~150 MB/s =
~18 minutes on a fast link).

See [`requantization-plan.md`](https://huggingface.co/Deviad/DeepSeek-V4-Flash-MLX-Q4Q8/blob/main/requantization-plan.md) for the
diagnostic write-up of why the requantize step is needed.

## License & attribution

This bundle is licensed under MIT, matching the upstream
[DeepSeek-V4-Flash license](LICENSE).

The original model and tech report are credited to the
[DeepSeek-AI](https://www.deepseek.com/) team. Please cite their work
when using this model:

```
@misc{deepseekv4,
  title  = {DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
  author = {DeepSeek-AI},
  year   = {2025},
  url    = {https://github.com/deepseek-ai/DeepSeek-V4}
}
```

The MLX-Q4Q8 quantization recipe is provided as-is and adds nothing
substantive to the science — it is purely a packaging artifact for
running the model on Apple-Silicon hardware.

## Acknowledgments

- DeepSeek-AI for the base model and the open-source release.
- The MLX team at Apple for the framework and the
  `mlx.core.quantize` reference implementation.
- The vMLX team for the `jang_tools` tooling and the `load_jangtq`
  loader (modulo the patch noted above).