Deviad
/

DeepSeek-V4-Flash-MLX-Q4Q8

@@ -164,6 +164,49 @@ This release is the output of:
 3. Rebuild `model.safetensors.index.json` to include the
    newly-introduced `.biases` keys.
 The community `mxfp4_to_affine.py` script that ships in some upstream
 DSV4 conversion guides uses `scale = (max-min)/15, bias = min`, which
 **does not** match MLX's affine convention. Bundles produced that way
@@ -210,8 +253,8 @@ project repo). Quick reference of the steps:
 ```
 `./build_mlx_q4q8.sh all` runs everything in order. Total runtime on
-M3 Ultra: ~75 minutes plus the initial download (~50 GB at ~150 MB/s =
-~6 minutes on a fast link).
 See [`requantization-plan.md`](https://github.com/...) for the
 diagnostic write-up of why the requantize step is needed.

 3. Rebuild `model.safetensors.index.json` to include the
    newly-introduced `.biases` keys.
+### Size vs. quality tradeoff
+This bundle is **173 GB** on disk vs. **~149 GB** for the upstream
+FP8 (non-experts) + FP4 (experts) release — about 24 GB of overhead.
+The extra space comes from MLX's affine quantization scheme:
+- **group_size = 32** (vs. upstream's 128×128 blocks): finer-grained
+  scales mean less quantization error per group, but more
+  scale/bias metadata per tensor.
+- **non-experts at Q8 affine** (vs. upstream FP8 block): keeps
+  attention, router, shared expert, embed/lm_head at 8-bit affine,
+  which is quality-sensitive and small in total — cheap to spend
+  bits on.
+- **experts at Q4 affine** (vs. upstream MXFP4): same nominal width,
+  but affine adds per-group `bias` tensors that MXFP4 doesn't carry.
+The choice is deliberate and quality-leaning rather than
+size-leaning. Rough perplexity deltas vs. bf16 (extrapolated from
+published llama.cpp / MLX quantization studies — not measured on
+V4-Flash specifically):
+| Knob                       | Size saved | Quality cost                     |
+|----------------------------|------------|----------------------------------|
+| group_size 32 → 64         | ~6–8 GB    | +0.1–0.3 % PPL                   |
+| group_size 32 → 128        | ~10–12 GB  | +0.3–0.8 % PPL                   |
+| Non-experts Q8 → Q6        | ~3–5 GB    | +0.1–0.3 % PPL                   |
+| Non-experts Q8 → Q4        | ~8–10 GB   | +0.5–2 % PPL, noticeable on long-context / reasoning |
+| Experts Q4 → Q3            | ~30–40 GB  | +2–6 % PPL, real degradation     |
+The current config is essentially lossless (<1 % PPL increase).
+**A more space-balanced alternative for 192 GB Macs**: keep Q8
+non-experts + Q4 experts but bump to `group_size=64` — saves ~6–8 GB,
+quality loss is in the noise. Going below Q4 on the experts is where
+MoE models fall off a cliff (each token only sees 6 of 256 experts,
+so quantization noise does not average out across the population),
+and gs=128 starts to bite on 1M-token contexts where small per-token
+errors compound.
+Net: the 24 GB overhead is the price of (a) MLX compatibility — there
+is no MLX kernel for DeepSeek's native FP8-block / MXFP4 layout — and
+(b) a config that errs on the side of preserving quality over
+shaving space.
 The community `mxfp4_to_affine.py` script that ships in some upstream
 DSV4 conversion guides uses `scale = (max-min)/15, bias = min`, which
 **does not** match MLX's affine convention. Bundles produced that way
 ```
 `./build_mlx_q4q8.sh all` runs everything in order. Total runtime on
+M3 Ultra: ~75 minutes plus the initial download (~160 GB at ~150 MB/s =
+~18 minutes on a fast link).
 See [`requantization-plan.md`](https://github.com/...) for the
 diagnostic write-up of why the requantize step is needed.