Why does GGUF conversion add third linear weight to MoE FFN?

#35

by eturok-weizmann - opened Oct 20, 2025

Oct 20, 2025

•

edited Oct 20, 2025

The official gpt-oss implementation uses 2 linear weights per expert, but the GGUF file has 3. The dimensions and total parameter counts also differ significantly between the two.

Questions:

How does the GGUF implementation produce equivalent results with a different architecture and fewer parameters?
How are the original 2-linear weights mapped to the 3-linear GGUF structure?

Let's go through the two implementations and for simplicity we will ignore the router and the biases of each linear layer.

Official gpt-oss implementation: The official gpt-oss implementation has two linear weights with a total of 1,061,683,200 FFN weights per layer:

self.mlp1_weight with shape (32, 5760, 2880)
self.mlp2_weight with shape (32, 2880, 5760)

Here, self.mlp1_weight projects each expert up and self.mlp2_weight projects it back down.

GGUF implementation: But the GGUF implementation of gpt-oss-20b-Q8_0.gguf has three linear weights with a total of 796,262,400 FFN weights per layer:

ffn_down_exps.weight with shape (2880,2880,32)
ffn_gate_exps.weight with shape (2880, 2880, 32)
ffn_up_exps.weight with shape (2880, 2880, 32)

According to the gguf docs, ffn_gate_exps.weight is a gate layer for each expert, ffn_up_exps.weight projects each expert up, and ffn_down_exps.weight projects each expert back down.

So how is it that the official gpt-oss implementation has two linear weights and the gguf version has three linear weights yet they give the same output (up to quantization)? They don't even have the same dimensions!

Does this have something to do with the mxfp4 quantization?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment