Deviad commited on
Commit
78a20cd
·
verified ·
1 Parent(s): 5a9248a

Document size vs. quality tradeoff

Browse files
Files changed (1) hide show
  1. README.md +45 -2
README.md CHANGED
@@ -164,6 +164,49 @@ This release is the output of:
164
  3. Rebuild `model.safetensors.index.json` to include the
165
  newly-introduced `.biases` keys.
166
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167
  The community `mxfp4_to_affine.py` script that ships in some upstream
168
  DSV4 conversion guides uses `scale = (max-min)/15, bias = min`, which
169
  **does not** match MLX's affine convention. Bundles produced that way
@@ -210,8 +253,8 @@ project repo). Quick reference of the steps:
210
  ```
211
 
212
  `./build_mlx_q4q8.sh all` runs everything in order. Total runtime on
213
- M3 Ultra: ~75 minutes plus the initial download (~50 GB at ~150 MB/s =
214
- ~6 minutes on a fast link).
215
 
216
  See [`requantization-plan.md`](https://github.com/...) for the
217
  diagnostic write-up of why the requantize step is needed.
 
164
  3. Rebuild `model.safetensors.index.json` to include the
165
  newly-introduced `.biases` keys.
166
 
167
+ ### Size vs. quality tradeoff
168
+
169
+ This bundle is **173 GB** on disk vs. **~149 GB** for the upstream
170
+ FP8 (non-experts) + FP4 (experts) release — about 24 GB of overhead.
171
+ The extra space comes from MLX's affine quantization scheme:
172
+
173
+ - **group_size = 32** (vs. upstream's 128×128 blocks): finer-grained
174
+ scales mean less quantization error per group, but more
175
+ scale/bias metadata per tensor.
176
+ - **non-experts at Q8 affine** (vs. upstream FP8 block): keeps
177
+ attention, router, shared expert, embed/lm_head at 8-bit affine,
178
+ which is quality-sensitive and small in total — cheap to spend
179
+ bits on.
180
+ - **experts at Q4 affine** (vs. upstream MXFP4): same nominal width,
181
+ but affine adds per-group `bias` tensors that MXFP4 doesn't carry.
182
+
183
+ The choice is deliberate and quality-leaning rather than
184
+ size-leaning. Rough perplexity deltas vs. bf16 (extrapolated from
185
+ published llama.cpp / MLX quantization studies — not measured on
186
+ V4-Flash specifically):
187
+
188
+ | Knob | Size saved | Quality cost |
189
+ |----------------------------|------------|----------------------------------|
190
+ | group_size 32 → 64 | ~6–8 GB | +0.1–0.3 % PPL |
191
+ | group_size 32 → 128 | ~10–12 GB | +0.3–0.8 % PPL |
192
+ | Non-experts Q8 → Q6 | ~3–5 GB | +0.1–0.3 % PPL |
193
+ | Non-experts Q8 → Q4 | ~8–10 GB | +0.5–2 % PPL, noticeable on long-context / reasoning |
194
+ | Experts Q4 → Q3 | ~30–40 GB | +2–6 % PPL, real degradation |
195
+
196
+ The current config is essentially lossless (<1 % PPL increase).
197
+ **A more space-balanced alternative for 192 GB Macs**: keep Q8
198
+ non-experts + Q4 experts but bump to `group_size=64` — saves ~6–8 GB,
199
+ quality loss is in the noise. Going below Q4 on the experts is where
200
+ MoE models fall off a cliff (each token only sees 6 of 256 experts,
201
+ so quantization noise does not average out across the population),
202
+ and gs=128 starts to bite on 1M-token contexts where small per-token
203
+ errors compound.
204
+
205
+ Net: the 24 GB overhead is the price of (a) MLX compatibility — there
206
+ is no MLX kernel for DeepSeek's native FP8-block / MXFP4 layout — and
207
+ (b) a config that errs on the side of preserving quality over
208
+ shaving space.
209
+
210
  The community `mxfp4_to_affine.py` script that ships in some upstream
211
  DSV4 conversion guides uses `scale = (max-min)/15, bias = min`, which
212
  **does not** match MLX's affine convention. Bundles produced that way
 
253
  ```
254
 
255
  `./build_mlx_q4q8.sh all` runs everything in order. Total runtime on
256
+ M3 Ultra: ~75 minutes plus the initial download (~160 GB at ~150 MB/s =
257
+ ~18 minutes on a fast link).
258
 
259
  See [`requantization-plan.md`](https://github.com/...) for the
260
  diagnostic write-up of why the requantize step is needed.