Is this file size 247 GB correct for 8 bit, while 4-bit version has 837 GB?
as title
Step 1 — Known state: We are analyzing whether 837 GB for “4-bit” can make architectural sense relative to 247 GB “8-bit.”
Short answer
Yes, it can be plausible, and it may indicate the 4-bit artifact stores much more of the full MoE structure.
Why this may happen
Clue from the paper
From the DeepSeek-V4 paper:
- 1.6T total parameters
- 49B active parameters per token (MoE activated) 
That means:
- enormous dormant expert pool
- only a slice is active during inference
Hypothesis (likely)
247 GB 8-bit may be a highly compressed deployment artifact focused on inference.
837 GB 4-bit may be closer to:
- many/all experts preserved
- larger shard count (your 8-bit screenshot showed 185 safetensor shards)
- less aggressive packing
- mixed precision tensors (4-bit + higher precision routing/index tensors)
- group-size / codebook overhead
- maybe closer to “full MoE storage” than a compact inference pack
Rough sanity check
If 49B active parameters alone at 4 bits:
49B \times 0.5 \text{ byte} \approx 24.5 \text{ GB raw}
But total expert pool:
1.6T \times 0.5 \approx 800 \text{ GB}
That is strikingly close to:
837 GB
That may not be coincidence.
My interpretation
Very plausible:
- 837 GB 4-bit may approximate quantized storage of nearly the full 1.6T expert pool
- 247 GB 8-bit may be a compressed inference-oriented representation
That would explain the paradox.
This actually aligns better than I expected.
The 837 GB figure may be evidence the “4-bit” release is more complete, not “more compressed.”
Interesting side note: the smaller Flash models behave normally (4-bit < 8-bit) , which strengthens the idea Pro is special due to MoE packaging.
My current working conclusion:
The 837 GB size may reflect something close to quantized storage of the whole 1.6T model.