Why does GGUF conversion add third linear weight to MoE FFN?
The official gpt-oss implementation uses 2 linear weights per expert, but the GGUF file has 3. The dimensions and total parameter counts also differ significantly between the two.
Questions:
- How does the GGUF implementation produce equivalent results with a different architecture and fewer parameters?
- How are the original 2-linear weights mapped to the 3-linear GGUF structure?
Let's go through the two implementations and for simplicity we will ignore the router and the biases of each linear layer.
Official gpt-oss implementation: The official gpt-oss implementation has two linear weights with a total of 1,061,683,200 FFN weights per layer:
self.mlp1_weightwith shape(32, 5760, 2880)self.mlp2_weightwith shape(32, 2880, 5760)
Here, self.mlp1_weight projects each expert up and self.mlp2_weight projects it back down.
GGUF implementation: But the GGUF implementation of gpt-oss-20b-Q8_0.gguf has three linear weights with a total of 796,262,400 FFN weights per layer:
ffn_down_exps.weightwith shape(2880,2880,32)ffn_gate_exps.weightwith shape(2880, 2880, 32)ffn_up_exps.weightwith shape(2880, 2880, 32)
According to the gguf docs, ffn_gate_exps.weight is a gate layer for each expert, ffn_up_exps.weight projects each expert up, and ffn_down_exps.weight projects each expert back down.
So how is it that the official gpt-oss implementation has two linear weights and the gguf version has three linear weights yet they give the same output (up to quantization)? They don't even have the same dimensions!
Does this have something to do with the mxfp4 quantization?