Correct BF16 source size (V3 carryover 1.3 TB → V4 ~600 GB)
Browse filesV4-Flash is ~284B params (~13B activated), not the 671B of V3. The 1.3 TB
BF16 source figure was V3-era math (671B × 2 bytes). V4-Flash with the
MTP block is ~600 GB BF16. The 172 GB measured artifact size is unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
README.md
CHANGED
|
@@ -19,7 +19,7 @@ A DeepSeek-V4-Flash NVFP4-FP8 quantization that retains the MTP (multi-token-pre
|
|
| 19 |
|
| 20 |
## What this is
|
| 21 |
|
| 22 |
-
- 172 GB across 35 safetensors shards (vs
|
| 23 |
- Same quantization scheme as `RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8`: NVFP4 (group=16, FP8 e4m3 scales) on routed FFN experts, FP8_BLOCK 128×128 on attention.
|
| 24 |
- MTP block (`mtp.0.*`, 799 tensors) kept at BF16 — not dropped at load time, not double-quantized when the MTP draft model is constructed.
|
| 25 |
|
|
|
|
| 19 |
|
| 20 |
## What this is
|
| 21 |
|
| 22 |
+
- 172 GB across 35 safetensors shards (vs ~600 GB BF16 source, MTP block included).
|
| 23 |
- Same quantization scheme as `RedHatAI/DeepSeek-V4-Flash-NVFP4-FP8`: NVFP4 (group=16, FP8 e4m3 scales) on routed FFN experts, FP8_BLOCK 128×128 on attention.
|
| 24 |
- MTP block (`mtp.0.*`, 799 tensors) kept at BF16 — not dropped at load time, not double-quantized when the MTP draft model is constructed.
|
| 25 |
|