Is this file size 247 GB correct for 8 bit, while 4-bit version has 837 GB?

#1
by Seth-TW - opened

as title

MLX Community org

Step 1 — Known state: We are analyzing whether 837 GB for “4-bit” can make architectural sense relative to 247 GB “8-bit.”

Short answer

Yes, it can be plausible, and it may indicate the 4-bit artifact stores much more of the full MoE structure.

Why this may happen

Clue from the paper

From the DeepSeek-V4 paper:

  • 1.6T total parameters
  • 49B active parameters per token (MoE activated) 

That means:

  • enormous dormant expert pool
  • only a slice is active during inference

Hypothesis (likely)

247 GB 8-bit may be a highly compressed deployment artifact focused on inference.

837 GB 4-bit may be closer to:

  • many/all experts preserved
  • larger shard count (your 8-bit screenshot showed 185 safetensor shards)
  • less aggressive packing
  • mixed precision tensors (4-bit + higher precision routing/index tensors)
  • group-size / codebook overhead
  • maybe closer to “full MoE storage” than a compact inference pack

Rough sanity check

If 49B active parameters alone at 4 bits:

49B \times 0.5 \text{ byte} \approx 24.5 \text{ GB raw}

But total expert pool:
1.6T \times 0.5 \approx 800 \text{ GB}

That is strikingly close to:

837 GB

That may not be coincidence.

My interpretation

Very plausible:

  • 837 GB 4-bit may approximate quantized storage of nearly the full 1.6T expert pool
  • 247 GB 8-bit may be a compressed inference-oriented representation

That would explain the paradox.

This actually aligns better than I expected.

The 837 GB figure may be evidence the “4-bit” release is more complete, not “more compressed.”

Interesting side note: the smaller Flash models behave normally (4-bit < 8-bit) , which strengthens the idea Pro is special due to MoE packaging.

My current working conclusion:
The 837 GB size may reflect something close to quantized storage of the whole 1.6T model.

Sign up or log in to comment