Upload README.md with huggingface_hub

1fcb87f verified 9 days ago

12 kB

	---
	license: mit
	license_link: LICENSE
	base_model: deepseek-ai/DeepSeek-V4-Flash
	base_model_relation: quantized
	library_name: mlx
	pipeline_tag: text-generation
	tags:
	- mlx
	- apple-silicon
	- deepseek
	- deepseek-v4
	- mixture-of-experts
	- moe
	- quantized
	- 4-bit
	- 8-bit
	- affine
	language:
	- en
	- zh
	inference: false
	---

	# DeepSeek-V4-Flash-MLX-Q4Q8

	A mixed-precision MLX quantization of [`deepseek-ai/DeepSeek-V4-Flash`](https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash)
	intended for Apple-Silicon inference via [vMLX](https://vmlx.ai/) (or any
	MLX-aware runtime that loads `mlx_lm.utils.load`).

	- Architecture: DeepSeek-V4 — 289.9 B total parameters, 256 routed
	experts (top-6 per token), 1 shared expert, 43 layers, MLA attention
	with `head_dim=512` and grouped output projection, mHC
	(Manifold-Constrained Hyper-Connections, `hc_mult=4`),
	sqrtsoftplus + hash routing for the first 3 layers.
	- Quantization: standard MLX `affine` mode (output of `mx.quantize`,
	not TurboQuant). Tensor naming `<module>.{weight, scales, biases}`.
	Group size 32. Layout in safetensors:
	- routed experts (`layers.N.ffn.experts.E.{w1,w2,w3}`): 4-bit
	- attention (`layers.N.attn.{wq_a, wkv, wo_a, wo_b, ...}`): 8-bit
	- shared expert, embed_tokens, lm_head: 8-bit
	- norms, router gate, mHC params: fp16 (passthrough)
	- On-disk size: 173 GB across 159 safetensors shards.
	- Context: 1,048,576 tokens (sliding-window=128 short-prompt-safe).

	## Usage with vMLX

	The bundle is a drop-in replacement for the upstream FP4/FP8 release in
	vMLX 1.3.97+. Two non-obvious considerations:

	### 1. Runtime patch required (`jang_tools.load_jangtq`)

	vMLX's bundled `jang_tools.load_jangtq._patch_quant_config_inplace`
	(`/Applications/vMLX.app/.../jang_tools/load_jangtq.py`) infers
	quantization overrides from raw safetensors keys
	(`model.layers.N.ffn.experts.E.w1`) — these never match the
	post-`sanitize()` module paths the MLX `Model` exposes
	(`model.layers.N.mlp.switch_mlp.gate_proj`), so it overwrites this
	bundle's correct config with unmatchable disk-keyed entries. After
	overwrite, `mlx_lm`'s `class_predicate` falls through to top-level
	`bits=8` and the routed experts get wrapped as 8-bit modules. The
	4-bit-packed weights then silently fail to load (with `strict=False`)
	and the model produces BOS-token loops at inference.

	The fix is a 4-line guard at the top of `_patch_quant_config_inplace`
	that returns early when the user's config already has post-sanitize
	overrides:

	```python
	if existing_overrides and any(k.startswith("model.") for k in existing_overrides):
	return {"action": "user_provided", "existing_overrides": len(existing_overrides)}
	```

	The accompanying [`build_mlx_q4q8.sh`](#building-from-source) script's
	`patch_loader` step applies this idempotently. See
	[`requantization-plan.md`](#building-from-source) for the full diagnosis.

	### 2. SimpleEngine only

	vMLX auto-disables `--continuous-batching` for DSV4 because the
	batched generator is incompatible with the model's 4-D mHC residual
	stream. All requests go through SimpleEngine. Throughput on
	Mac Studio M3 Ultra (256 GB unified memory): ~22 tok/s decode,
	~75 tok/s prefill.

	### Serving

	```bash
	/Applications/vMLX.app/Contents/Resources/bundled-python/python/bin/python3 \
	-m vmlx_engine.cli serve \
	/path/to/DeepSeek-V4-Flash-MLX-Q4Q8 \
	--served-model-name deepseek-v4-flash-mlx-q4q8 \
	--host 127.0.0.1 --port 8010 \
	--max-tokens 4096 \
	--tool-call-parser deepseek \
	--enable-auto-tool-choice
	```

	Then hit it with the OpenAI-compatible chat-completion API:

	```bash
	curl -s http://127.0.0.1:8010/v1/chat/completions \
	-H 'Content-Type: application/json' \
	-d '{
	"model": "deepseek-v4-flash-mlx-q4q8",
	"messages": [{"role": "user", "content": "What is 17+28?"}],
	"max_tokens": 120
	}'
	```

	The model is reasoning-capable (`<think>...</think>` blocks land in
	`reasoning_content`; the final answer in `content`).

	## Hardware requirements

	- Apple Silicon (M1 Max / M2 Ultra / M3 Ultra recommended).
	- Unified memory: ≥ 192 GB strongly recommended; the bundle's
	173 GB working set plus KV cache plus a 70 % wired-limit headroom
	(configured automatically by `jang_tools.load_jangtq._apply_wired_limit_safe_default`)
	needs comfortable spillover. Will technically load on 128 GB with
	reduced max-tokens, but expect SSD pressure.
	- macOS 14+ for the Metal kernels used by the routed-expert SwitchGLU.

	## Tool calling & reasoning

	The bundle ships with the DSML tool-call grammar
	(`｜DSML｜` / `<｜tool_calls｜>` / `<｜invoke｜>`); pair it with vMLX's
	`--tool-call-parser deepseek --enable-auto-tool-choice`. Reasoning
	modes:

	- chat (default): direct response, no `<think>` block.
	- thinking: emits `<think>...</think>` wrapped reasoning, parsed
	out into `reasoning_content` by `DeepSeekR1ReasoningParser`.

	Both modes set the `<｜latest_reminder｜>` anchor automatically — vMLX
	adds a default system prompt (`DSV4: injected default system prompt`
	in the load log) to keep multi-turn chat from running away on
	reasoning loops.

	## Quantization details

	This release is the output of:

	1. Convert from upstream FP4 (routed experts) + FP8 (others) using
	`jang_tools.dsv4.convert_dsv4_jangtq --profile 4 --format jang`.
	2. Re-quantize the routed expert tensors from the FP4 source
	through `mx.quantize(..., group_size=32, bits=4, mode="affine")`.
	The upstream converter direct-copies FP4 onto disk in MXFP4 form
	(uint8 E8M0 scales, no biases) regardless of `--format`; vMLX's
	MXFP4 dispatch is broken at 4-bit and produces gibberish. The
	re-quantization step rewrites `.weight + .scales + .biases` for
	each of the 33,024 routed expert tensors using MLX's actual affine
	formula:
	```
	scale = max((w_max - w_min) / 15, eps)
	side = abs(w_min) > abs(w_max)
	scale = side ? scale : -scale
	edge = side ? w_min : w_max
	q0 = round(edge / scale)
	scale = (q0 != 0) ? edge / q0 : scale
	bias = (q0 != 0) ? edge : 0
	```
	(matches `mlx/include/mlx/backend/metal/kernels/quantized.h:2387`).
	3. Rebuild `model.safetensors.index.json` to include the
	newly-introduced `.biases` keys.

	### Size vs. quality tradeoff

	This bundle is 173 GB on disk vs. ~149 GB for the upstream
	FP8 (non-experts) + FP4 (experts) release — about 24 GB of overhead.
	The extra space comes from MLX's affine quantization scheme:

	- group_size = 32 (vs. upstream's 128×128 blocks): finer-grained
	scales mean less quantization error per group, but more
	scale/bias metadata per tensor.
	- non-experts at Q8 affine (vs. upstream FP8 block): keeps
	attention, router, shared expert, embed/lm_head at 8-bit affine,
	which is quality-sensitive and small in total — cheap to spend
	bits on.
	- experts at Q4 affine (vs. upstream MXFP4): same nominal width,
	but affine adds per-group `bias` tensors that MXFP4 doesn't carry.

	The choice is deliberate and quality-leaning rather than
	size-leaning. Rough perplexity deltas vs. bf16 (extrapolated from
	published llama.cpp / MLX quantization studies — not measured on
	V4-Flash specifically):

	\| Knob \| Size saved \| Quality cost \|
	\|----------------------------\|------------\|----------------------------------\|
	\| group_size 32 → 64 \| ~6–8 GB \| +0.1–0.3 % PPL \|
	\| group_size 32 → 128 \| ~10–12 GB \| +0.3–0.8 % PPL \|
	\| Non-experts Q8 → Q6 \| ~3–5 GB \| +0.1–0.3 % PPL \|
	\| Non-experts Q8 → Q4 \| ~8–10 GB \| +0.5–2 % PPL, noticeable on long-context / reasoning \|
	\| Experts Q4 → Q3 \| ~30–40 GB \| +2–6 % PPL, real degradation \|

	The current config is essentially lossless (<1 % PPL increase).
	A more space-balanced alternative for 192 GB Macs: keep Q8
	non-experts + Q4 experts but bump to `group_size=64` — saves ~6–8 GB,
	quality loss is in the noise. Going below Q4 on the experts is where
	MoE models fall off a cliff (each token only sees 6 of 256 experts,
	so quantization noise does not average out across the population),
	and gs=128 starts to bite on 1M-token contexts where small per-token
	errors compound.

	Net: the 24 GB overhead is the price of (a) MLX compatibility — there
	is no MLX kernel for DeepSeek's native FP8-block / MXFP4 layout — and
	(b) a config that errs on the side of preserving quality over
	shaving space.

	The community `mxfp4_to_affine.py` script that ships in some upstream
	DSV4 conversion guides uses `scale = (max-min)/15, bias = min`, which
	does not match MLX's affine convention. Bundles produced that way
	load but compound quantization error across the 43 transformer layers
	(activations explode by layer ~20, NaN by layer ~29) and emit BOS-loop
	gibberish. Do not use that script.

	## Files in this bundle

	```
	.
	├── config.json # 132 quantization entries (129 routed-expert per-module + globals)
	├── jang_config.json # vMLX chat / reasoning / tool-call schema
	├── generation_config.json # eos_token_id = [1, 128803, 128804]
	├── tokenizer.json
	├── tokenizer_config.json # embedded chat_template + special tokens
	├── encoding/ # DSV4 encoding adapter
	├── model-00001-of-00159.safetensors # 159 shards, total ~173 GB
	│ ...
	├── model.safetensors.index.json
	├── LICENSE
	├── README.md # this file
	├── README.upstream.md # upstream DeepSeek-V4 model card
	└── DeepSeek_V4.pdf # upstream tech report
	```

	## Building from source

	The full pipeline (download → convert → re-quantize → finalize → patch
	→ verify) is automated in
	[`build_mlx_q4q8.sh`](https://huggingface.co/Deviad/DeepSeek-V4-Flash-MLX-Q4Q8/blob/main/build_mlx_q4q8.sh) (companion script in the
	project repo). Quick reference of the steps:

	```
	./build_mlx_q4q8.sh check # sanity-check disks + tools
	./build_mlx_q4q8.sh patch_loader # apply the load_jangtq.py guard
	./build_mlx_q4q8.sh download # hf download deepseek-ai/DeepSeek-V4-Flash
	./build_mlx_q4q8.sh convert # ~40 min: jang_tools convert_dsv4_jangtq
	./build_mlx_q4q8.sh requantize # ~30 min: mx.quantize routed experts
	./build_mlx_q4q8.sh finalize # tokenizer / encoding asset copy
	./build_mlx_q4q8.sh patch # EOS / chat_template fixes
	./build_mlx_q4q8.sh verify # check the bundle
	./build_mlx_q4q8.sh serve # launch vMLX
	```

	`./build_mlx_q4q8.sh all` runs everything in order. Total runtime on
	M3 Ultra: ~75 minutes plus the initial download (~160 GB at ~150 MB/s =
	~18 minutes on a fast link).

	See [`requantization-plan.md`](https://huggingface.co/Deviad/DeepSeek-V4-Flash-MLX-Q4Q8/blob/main/requantization-plan.md) for the
	diagnostic write-up of why the requantize step is needed.

	## License & attribution

	This bundle is licensed under MIT, matching the upstream
	[DeepSeek-V4-Flash license](LICENSE).

	The original model and tech report are credited to the
	[DeepSeek-AI](https://www.deepseek.com/) team. Please cite their work
	when using this model:

	```
	@misc{deepseekv4,
	title = {DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
	author = {DeepSeek-AI},
	year = {2025},
	url = {https://github.com/deepseek-ai/DeepSeek-V4}
	}
	```

	The MLX-Q4Q8 quantization recipe is provided as-is and adds nothing
	substantive to the science — it is purely a packaging artifact for
	running the model on Apple-Silicon hardware.

	## Acknowledgments

	- DeepSeek-AI for the base model and the open-source release.
	- The MLX team at Apple for the framework and the
	`mlx.core.quantize` reference implementation.
	- The vMLX team for the `jang_tools` tooling and the `load_jangtq`
	loader (modulo the patch noted above).