On the parameter counts and VRAM requirements

#2
by PierreLepagnol - opened

Thanks for the great work here!

I had a question about this claim: "GPT-OSS-20B fits in 1.5GB yet matches Champions-league dense models requiring 8.5GB. MoE architecture is the edge AI game-changer."

I may be misunderstanding it, but I'm confused by the 1.5GB figure. For GPT-OSS-20B (21B total parameters, 3.6B active per token), there's an important distinction between:

  • Total parameters stored - at 4-bit precision: 21B ร— 4/8 = ~10.5 GB
  • Active parameters per token - at 4-bit precision: 3.6B ร— 4/8 = ~1.8 GB
  • Actual runtime memory footprint - which should also include KV cache, activations, buffers, and overhead

A number around ~1.5โ€“1.8 GB could correspond to active parameters at 4-bit precision, but not to the full memory required to load and run the model - which is published as roughly 16 GB. So it's not immediately clear how to interpret the 1.5 GB claim.

Could you clarify:

  1. How the 1.5 GB number was computed - does it refer to active expert weights only, or full inference memory?
  2. What the 8.5 GB dense-model baseline is measuring on the other side, so the comparison is apples-to-apples?

I think a short clarification would really help readers interpret the MoE advantage correctly, since "compute used per token" and "memory needed to run the model" are easy to conflate.

Thanks again!

Thank you โ€” this is an excellent point and you're absolutely right to flag it.The 1.5GB figure in our current presentation refers to the active expert memory per forward pass (3.6B active params at ~4-bit), not the full memory required to load and run the model. That distinction was not made clear, and I agree it's misleading as written.To be precise:GPT-OSS-20B (MoE):

Total model weights (Q4): ~10.5 GB (all experts loaded)
Active computation per token: ~1.8 GB (3.6B active params)
Full runtime footprint: ~16 GB (weights + KV cache + overhead)
Gemma-3-12B (Dense):

Total model weights (Q4): ~6.5 GB
Active computation per token: ~6.5 GB (all params active)
Full runtime footprint: ~8.5 GB (weights + KV cache + overhead)
The MoE advantage is real, but it's in compute efficiency per token (3.6B vs 12B active), not in total memory footprint. We conflated the two, and that's our error.We will update:

The leaderboard RAM column to reflect actual runtime memory, not active-param-only estimates
The article and insights text to clearly distinguish "compute per token" from "memory to run"
The league classification for MoE models based on corrected RAM values
The PIR formula uses the RAM (T) column, so correcting this will also shift the WCS rankings โ€” which is the right thing to do.Thank you for catching this. This is exactly the kind of feedback that makes the benchmark more honest โ€” which, given that we literally have an Honesty axis, we should hold ourselves to the same standard.

Sign up or log in to comment