Document size vs. quality tradeoff
Browse files
README.md
CHANGED
|
@@ -164,6 +164,49 @@ This release is the output of:
|
|
| 164 |
3. Rebuild `model.safetensors.index.json` to include the
|
| 165 |
newly-introduced `.biases` keys.
|
| 166 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 167 |
The community `mxfp4_to_affine.py` script that ships in some upstream
|
| 168 |
DSV4 conversion guides uses `scale = (max-min)/15, bias = min`, which
|
| 169 |
**does not** match MLX's affine convention. Bundles produced that way
|
|
@@ -210,8 +253,8 @@ project repo). Quick reference of the steps:
|
|
| 210 |
```
|
| 211 |
|
| 212 |
`./build_mlx_q4q8.sh all` runs everything in order. Total runtime on
|
| 213 |
-
M3 Ultra: ~75 minutes plus the initial download (~
|
| 214 |
-
~
|
| 215 |
|
| 216 |
See [`requantization-plan.md`](https://github.com/...) for the
|
| 217 |
diagnostic write-up of why the requantize step is needed.
|
|
|
|
| 164 |
3. Rebuild `model.safetensors.index.json` to include the
|
| 165 |
newly-introduced `.biases` keys.
|
| 166 |
|
| 167 |
+
### Size vs. quality tradeoff
|
| 168 |
+
|
| 169 |
+
This bundle is **173 GB** on disk vs. **~149 GB** for the upstream
|
| 170 |
+
FP8 (non-experts) + FP4 (experts) release — about 24 GB of overhead.
|
| 171 |
+
The extra space comes from MLX's affine quantization scheme:
|
| 172 |
+
|
| 173 |
+
- **group_size = 32** (vs. upstream's 128×128 blocks): finer-grained
|
| 174 |
+
scales mean less quantization error per group, but more
|
| 175 |
+
scale/bias metadata per tensor.
|
| 176 |
+
- **non-experts at Q8 affine** (vs. upstream FP8 block): keeps
|
| 177 |
+
attention, router, shared expert, embed/lm_head at 8-bit affine,
|
| 178 |
+
which is quality-sensitive and small in total — cheap to spend
|
| 179 |
+
bits on.
|
| 180 |
+
- **experts at Q4 affine** (vs. upstream MXFP4): same nominal width,
|
| 181 |
+
but affine adds per-group `bias` tensors that MXFP4 doesn't carry.
|
| 182 |
+
|
| 183 |
+
The choice is deliberate and quality-leaning rather than
|
| 184 |
+
size-leaning. Rough perplexity deltas vs. bf16 (extrapolated from
|
| 185 |
+
published llama.cpp / MLX quantization studies — not measured on
|
| 186 |
+
V4-Flash specifically):
|
| 187 |
+
|
| 188 |
+
| Knob | Size saved | Quality cost |
|
| 189 |
+
|----------------------------|------------|----------------------------------|
|
| 190 |
+
| group_size 32 → 64 | ~6–8 GB | +0.1–0.3 % PPL |
|
| 191 |
+
| group_size 32 → 128 | ~10–12 GB | +0.3–0.8 % PPL |
|
| 192 |
+
| Non-experts Q8 → Q6 | ~3–5 GB | +0.1–0.3 % PPL |
|
| 193 |
+
| Non-experts Q8 → Q4 | ~8–10 GB | +0.5–2 % PPL, noticeable on long-context / reasoning |
|
| 194 |
+
| Experts Q4 → Q3 | ~30–40 GB | +2–6 % PPL, real degradation |
|
| 195 |
+
|
| 196 |
+
The current config is essentially lossless (<1 % PPL increase).
|
| 197 |
+
**A more space-balanced alternative for 192 GB Macs**: keep Q8
|
| 198 |
+
non-experts + Q4 experts but bump to `group_size=64` — saves ~6–8 GB,
|
| 199 |
+
quality loss is in the noise. Going below Q4 on the experts is where
|
| 200 |
+
MoE models fall off a cliff (each token only sees 6 of 256 experts,
|
| 201 |
+
so quantization noise does not average out across the population),
|
| 202 |
+
and gs=128 starts to bite on 1M-token contexts where small per-token
|
| 203 |
+
errors compound.
|
| 204 |
+
|
| 205 |
+
Net: the 24 GB overhead is the price of (a) MLX compatibility — there
|
| 206 |
+
is no MLX kernel for DeepSeek's native FP8-block / MXFP4 layout — and
|
| 207 |
+
(b) a config that errs on the side of preserving quality over
|
| 208 |
+
shaving space.
|
| 209 |
+
|
| 210 |
The community `mxfp4_to_affine.py` script that ships in some upstream
|
| 211 |
DSV4 conversion guides uses `scale = (max-min)/15, bias = min`, which
|
| 212 |
**does not** match MLX's affine convention. Bundles produced that way
|
|
|
|
| 253 |
```
|
| 254 |
|
| 255 |
`./build_mlx_q4q8.sh all` runs everything in order. Total runtime on
|
| 256 |
+
M3 Ultra: ~75 minutes plus the initial download (~160 GB at ~150 MB/s =
|
| 257 |
+
~18 minutes on a fast link).
|
| 258 |
|
| 259 |
See [`requantization-plan.md`](https://github.com/...) for the
|
| 260 |
diagnostic write-up of why the requantize step is needed.
|