| --- |
| license: mit |
| license_link: LICENSE |
| base_model: deepseek-ai/DeepSeek-V4-Flash |
| base_model_relation: quantized |
| library_name: mlx |
| pipeline_tag: text-generation |
| tags: |
| - mlx |
| - apple-silicon |
| - deepseek |
| - deepseek-v4 |
| - mixture-of-experts |
| - moe |
| - quantized |
| - 4-bit |
| - 8-bit |
| - affine |
| language: |
| - en |
| - zh |
| inference: false |
| --- |
| |
| # DeepSeek-V4-Flash-MLX-Q4Q8 |
|
|
| A mixed-precision MLX quantization of [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash) |
| intended for Apple-Silicon inference via [vMLX](https://vmlx.ai/) (or any |
| MLX-aware runtime that loads `mlx_lm.utils.load`). |
|
|
| - **Architecture**: DeepSeek-V4 — 289.9 B total parameters, 256 routed |
| experts (top-6 per token), 1 shared expert, 43 layers, MLA attention |
| with `head_dim=512` and grouped output projection, mHC |
| (Manifold-Constrained Hyper-Connections, `hc_mult=4`), |
| sqrtsoftplus + hash routing for the first 3 layers. |
| - **Quantization**: standard MLX `affine` mode (output of `mx.quantize`, |
| not TurboQuant). Tensor naming `<module>.{weight, scales, biases}`. |
| Group size 32. Layout in safetensors: |
| - **routed experts** (`layers.N.ffn.experts.E.{w1,w2,w3}`): 4-bit |
| - **attention** (`layers.N.attn.{wq_a, wkv, wo_a, wo_b, ...}`): 8-bit |
| - **shared expert, embed_tokens, lm_head**: 8-bit |
| - **norms, router gate, mHC params**: fp16 (passthrough) |
| - **On-disk size**: 173 GB across 159 safetensors shards. |
| - **Context**: 1,048,576 tokens (sliding-window=128 short-prompt-safe). |
|
|
| ## Usage with vMLX |
|
|
| The bundle is a drop-in replacement for the upstream FP4/FP8 release in |
| vMLX 1.3.97+. Two non-obvious considerations: |
|
|
| ### 1. Runtime patch required (`jang_tools.load_jangtq`) |
|
|
| vMLX's bundled `jang_tools.load_jangtq._patch_quant_config_inplace` |
| (`/Applications/vMLX.app/.../jang_tools/load_jangtq.py`) infers |
| quantization overrides from raw safetensors keys |
| (`model.layers.N.ffn.experts.E.w1`) — these never match the |
| post-`sanitize()` module paths the MLX `Model` exposes |
| (`model.layers.N.mlp.switch_mlp.gate_proj`), so it overwrites this |
| bundle's correct config with unmatchable disk-keyed entries. After |
| overwrite, `mlx_lm`'s `class_predicate` falls through to top-level |
| `bits=8` and the routed experts get wrapped as 8-bit modules. The |
| 4-bit-packed weights then silently fail to load (with `strict=False`) |
| and the model produces BOS-token loops at inference. |
|
|
| The fix is a 4-line guard at the top of `_patch_quant_config_inplace` |
| that returns early when the user's config already has post-sanitize |
| overrides: |
|
|
| ```python |
| if existing_overrides and any(k.startswith("model.") for k in existing_overrides): |
| return {"action": "user_provided", "existing_overrides": len(existing_overrides)} |
| ``` |
|
|
| The accompanying [`build_mlx_q4q8.sh`](#building-from-source) script's |
| `patch_loader` step applies this idempotently. See |
| [`requantization-plan.md`](#building-from-source) for the full diagnosis. |
|
|
| ### 2. SimpleEngine only |
|
|
| vMLX auto-disables `--continuous-batching` for DSV4 because the |
| batched generator is incompatible with the model's 4-D mHC residual |
| stream. All requests go through SimpleEngine. Throughput on |
| Mac Studio M3 Ultra (256 GB unified memory): ~22 tok/s decode, |
| ~75 tok/s prefill. |
|
|
| ### Serving |
|
|
| ```bash |
| /Applications/vMLX.app/Contents/Resources/bundled-python/python/bin/python3 \ |
| -m vmlx_engine.cli serve \ |
| /path/to/DeepSeek-V4-Flash-MLX-Q4Q8 \ |
| --served-model-name deepseek-v4-flash-mlx-q4q8 \ |
| --host 127.0.0.1 --port 8010 \ |
| --max-tokens 4096 \ |
| --tool-call-parser deepseek \ |
| --enable-auto-tool-choice |
| ``` |
|
|
| Then hit it with the OpenAI-compatible chat-completion API: |
|
|
| ```bash |
| curl -s http://127.0.0.1:8010/v1/chat/completions \ |
| -H 'Content-Type: application/json' \ |
| -d '{ |
| "model": "deepseek-v4-flash-mlx-q4q8", |
| "messages": [{"role": "user", "content": "What is 17+28?"}], |
| "max_tokens": 120 |
| }' |
| ``` |
|
|
| The model is reasoning-capable (`<think>...</think>` blocks land in |
| `reasoning_content`; the final answer in `content`). |
|
|
| ## Hardware requirements |
|
|
| - Apple Silicon (M1 Max / M2 Ultra / M3 Ultra recommended). |
| - **Unified memory**: ≥ 192 GB strongly recommended; the bundle's |
| 173 GB working set plus KV cache plus a 70 % wired-limit headroom |
| (configured automatically by `jang_tools.load_jangtq._apply_wired_limit_safe_default`) |
| needs comfortable spillover. Will technically load on 128 GB with |
| reduced max-tokens, but expect SSD pressure. |
| - macOS 14+ for the Metal kernels used by the routed-expert SwitchGLU. |
|
|
| ## Tool calling & reasoning |
|
|
| The bundle ships with the DSML tool-call grammar |
| (`|DSML|` / `<|tool_calls|>` / `<|invoke|>`); pair it with vMLX's |
| `--tool-call-parser deepseek --enable-auto-tool-choice`. Reasoning |
| modes: |
|
|
| - **chat** (default): direct response, no `<think>` block. |
| - **thinking**: emits `<think>...</think>` wrapped reasoning, parsed |
| out into `reasoning_content` by `DeepSeekR1ReasoningParser`. |
|
|
| Both modes set the `<|latest_reminder|>` anchor automatically — vMLX |
| adds a default system prompt (`DSV4: injected default system prompt` |
| in the load log) to keep multi-turn chat from running away on |
| reasoning loops. |
|
|
| ## Quantization details |
|
|
| This release is the output of: |
|
|
| 1. Convert from upstream FP4 (routed experts) + FP8 (others) using |
| `jang_tools.dsv4.convert_dsv4_jangtq --profile 4 --format jang`. |
| 2. **Re-quantize the routed expert tensors** from the FP4 source |
| through `mx.quantize(..., group_size=32, bits=4, mode="affine")`. |
| The upstream converter direct-copies FP4 onto disk in MXFP4 form |
| (uint8 E8M0 scales, no biases) regardless of `--format`; vMLX's |
| MXFP4 dispatch is broken at 4-bit and produces gibberish. The |
| re-quantization step rewrites `.weight + .scales + .biases` for |
| each of the 33,024 routed expert tensors using MLX's actual affine |
| formula: |
| ``` |
| scale = max((w_max - w_min) / 15, eps) |
| side = abs(w_min) > abs(w_max) |
| scale = side ? scale : -scale |
| edge = side ? w_min : w_max |
| q0 = round(edge / scale) |
| scale = (q0 != 0) ? edge / q0 : scale |
| bias = (q0 != 0) ? edge : 0 |
| ``` |
| (matches `mlx/include/mlx/backend/metal/kernels/quantized.h:2387`). |
| 3. Rebuild `model.safetensors.index.json` to include the |
| newly-introduced `.biases` keys. |
|
|
| ### Size vs. quality tradeoff |
|
|
| This bundle is **173 GB** on disk vs. **~149 GB** for the upstream |
| FP8 (non-experts) + FP4 (experts) release — about 24 GB of overhead. |
| The extra space comes from MLX's affine quantization scheme: |
|
|
| - **group_size = 32** (vs. upstream's 128×128 blocks): finer-grained |
| scales mean less quantization error per group, but more |
| scale/bias metadata per tensor. |
| - **non-experts at Q8 affine** (vs. upstream FP8 block): keeps |
| attention, router, shared expert, embed/lm_head at 8-bit affine, |
| which is quality-sensitive and small in total — cheap to spend |
| bits on. |
| - **experts at Q4 affine** (vs. upstream MXFP4): same nominal width, |
| but affine adds per-group `bias` tensors that MXFP4 doesn't carry. |
|
|
| The choice is deliberate and quality-leaning rather than |
| size-leaning. Rough perplexity deltas vs. bf16 (extrapolated from |
| published llama.cpp / MLX quantization studies — not measured on |
| V4-Flash specifically): |
|
|
| | Knob | Size saved | Quality cost | |
| |----------------------------|------------|----------------------------------| |
| | group_size 32 → 64 | ~6–8 GB | +0.1–0.3 % PPL | |
| | group_size 32 → 128 | ~10–12 GB | +0.3–0.8 % PPL | |
| | Non-experts Q8 → Q6 | ~3–5 GB | +0.1–0.3 % PPL | |
| | Non-experts Q8 → Q4 | ~8–10 GB | +0.5–2 % PPL, noticeable on long-context / reasoning | |
| | Experts Q4 → Q3 | ~30–40 GB | +2–6 % PPL, real degradation | |
|
|
| The current config is essentially lossless (<1 % PPL increase). |
| **A more space-balanced alternative for 192 GB Macs**: keep Q8 |
| non-experts + Q4 experts but bump to `group_size=64` — saves ~6–8 GB, |
| quality loss is in the noise. Going below Q4 on the experts is where |
| MoE models fall off a cliff (each token only sees 6 of 256 experts, |
| so quantization noise does not average out across the population), |
| and gs=128 starts to bite on 1M-token contexts where small per-token |
| errors compound. |
|
|
| Net: the 24 GB overhead is the price of (a) MLX compatibility — there |
| is no MLX kernel for DeepSeek's native FP8-block / MXFP4 layout — and |
| (b) a config that errs on the side of preserving quality over |
| shaving space. |
|
|
| The community `mxfp4_to_affine.py` script that ships in some upstream |
| DSV4 conversion guides uses `scale = (max-min)/15, bias = min`, which |
| **does not** match MLX's affine convention. Bundles produced that way |
| load but compound quantization error across the 43 transformer layers |
| (activations explode by layer ~20, NaN by layer ~29) and emit BOS-loop |
| gibberish. Do not use that script. |
|
|
| ## Files in this bundle |
|
|
| ``` |
| . |
| ├── config.json # 132 quantization entries (129 routed-expert per-module + globals) |
| ├── jang_config.json # vMLX chat / reasoning / tool-call schema |
| ├── generation_config.json # eos_token_id = [1, 128803, 128804] |
| ├── tokenizer.json |
| ├── tokenizer_config.json # embedded chat_template + special tokens |
| ├── encoding/ # DSV4 encoding adapter |
| ├── model-00001-of-00159.safetensors # 159 shards, total ~173 GB |
| │ ... |
| ├── model.safetensors.index.json |
| ├── LICENSE |
| ├── README.md # this file |
| ├── README.upstream.md # upstream DeepSeek-V4 model card |
| └── DeepSeek_V4.pdf # upstream tech report |
| ``` |
|
|
| ## Building from source |
|
|
| The full pipeline (download → convert → re-quantize → finalize → patch |
| → verify) is automated in |
| [`build_mlx_q4q8.sh`](https://huggingface.co/Deviad/DeepSeek-V4-Flash-MLX-Q4Q8/blob/main/build_mlx_q4q8.sh) (companion script in the |
| project repo). Quick reference of the steps: |
|
|
| ``` |
| ./build_mlx_q4q8.sh check # sanity-check disks + tools |
| ./build_mlx_q4q8.sh patch_loader # apply the load_jangtq.py guard |
| ./build_mlx_q4q8.sh download # hf download deepseek-ai/DeepSeek-V4-Flash |
| ./build_mlx_q4q8.sh convert # ~40 min: jang_tools convert_dsv4_jangtq |
| ./build_mlx_q4q8.sh requantize # ~30 min: mx.quantize routed experts |
| ./build_mlx_q4q8.sh finalize # tokenizer / encoding asset copy |
| ./build_mlx_q4q8.sh patch # EOS / chat_template fixes |
| ./build_mlx_q4q8.sh verify # check the bundle |
| ./build_mlx_q4q8.sh serve # launch vMLX |
| ``` |
|
|
| `./build_mlx_q4q8.sh all` runs everything in order. Total runtime on |
| M3 Ultra: ~75 minutes plus the initial download (~160 GB at ~150 MB/s = |
| ~18 minutes on a fast link). |
|
|
| See [`requantization-plan.md`](https://huggingface.co/Deviad/DeepSeek-V4-Flash-MLX-Q4Q8/blob/main/requantization-plan.md) for the |
| diagnostic write-up of why the requantize step is needed. |
|
|
| ## License & attribution |
|
|
| This bundle is licensed under MIT, matching the upstream |
| [DeepSeek-V4-Flash license](LICENSE). |
|
|
| The original model and tech report are credited to the |
| [DeepSeek-AI](https://www.deepseek.com/) team. Please cite their work |
| when using this model: |
|
|
| ``` |
| @misc{deepseekv4, |
| title = {DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence}, |
| author = {DeepSeek-AI}, |
| year = {2025}, |
| url = {https://github.com/deepseek-ai/DeepSeek-V4} |
| } |
| ``` |
|
|
| The MLX-Q4Q8 quantization recipe is provided as-is and adds nothing |
| substantive to the science — it is purely a packaging artifact for |
| running the model on Apple-Silicon hardware. |
|
|
| ## Acknowledgments |
|
|
| - DeepSeek-AI for the base model and the open-source release. |
| - The MLX team at Apple for the framework and the |
| `mlx.core.quantize` reference implementation. |
| - The vMLX team for the `jang_tools` tooling and the `load_jangtq` |
| loader (modulo the patch noted above). |
|
|