LLM Surgery Dark Arts: GPT-OSS 40B (64 Experts, Active 8) · BnB 4-bit


This edition	BnB 4-bit · 64 experts · top-8 · 24 layers · ~40B total · ~7.2B active
Base model	`unsloth/gpt-oss-20b-unsloth-bnb-4bit` : 32 experts · top-4 · 24 layers · ~21B
Status	Init checkpoint, identical to gpt-oss-20b, requires fine-tuning

All Editions

Edition	Experts	Active	Format	~Params	Hub
40B-64a8	64	8	MXFP4	40B	`gpt-oss-40b-64a8`
60B-96a12	96	12	MXFP4	60B	`gpt-oss-60b-96a12`
40B-64a8	64	8	BnB 4-bit	40B	`gpt-oss-40b-64a8-init`
60B-96a12	96	12	BnB 4-bit	60B	`gpt-oss-60b-96a12-init`

Init checkpoints, mathematically identical to gpt-oss-20b at initialization. Designed as expanded-capacity bases for fine-tuning on complex reasoning and domain specialization tasks.

1. Context

The Bottleneck

gpt-oss-20b achieves remarkable performance from a compact architecture: 32 experts, 4 active per token, 24 layers. For general use this is sufficient. For research labs pushing domain-specific reasoning (mathematical competition, complex code generation, scientific inference), the expert pool becomes the limiting factor. The hidden dimension and depth are adequate; the number of distinct expert combinations the router can assign is not.

Width Over Depth

Adding layers increases KV cache consumption linearly, directly reducing throughput and maximum context length. Adding experts with the same layer count leaves the KV cache untouched.

Depth (more layers): linear KV cost, modest capacity gain
Width (more experts, same layers): zero KV impact, exponential routing growth

These models expand the expert pool while preserving the 24-layer architecture exactly. Attention, positional encoding, vocabulary, sliding window: all unchanged. Only the MoE routing space grows.

The quality of the resulting fine-tuned model depends on dataset preparation and training strategy, specifically on ensuring the expanded expert pool is utilized diversely rather than collapsing back to redundant configurations.

Train Wide, Deploy Narrow

The 64a8 and 96a12 configurations activate more experts per token than the original (8 or 12 vs 4). This is intentional for training: more active experts means each token provides gradient signal to more parameters, accelerating diversification.

After training, the active count can be reduced via router bias adjustment to recover gpt-oss-20b-class throughput while retaining the broader expert pool. Train wide, deploy narrow.

2. Mathematical Foundation: Silent Init

Principle

The expert pool is expanded by factor M (x2 for 64a8, x3 for 96a12). Each new expert is initialized from an existing one. The router is expanded with duplicated structure. The active expert count scales by the same factor M.

Under softmax normalization, the multiplicity cancels exactly.

Proof

Standard MoE forward pass:

FFN(h) = Σ_{i ∈ top-k(s)} p_i(h) · E_i(h)

where  s_i = W_r[i] · h + b_r[i]           (router logits)
       p_i = exp(s_i) / Σ_{j∈S} exp(s_j)   (softmax over selected set S)

After expansion with multiplier M (expert count E' = M·E, active count k' = M·k), top-k' selects exactly M copies of each original top-k expert. Within the softmax:

Σ_{copies of i} p'_copy = M · exp(s_i) / (M · Σ_{orig top-k} exp(s_j)) = p_i

Factor M in numerator and denominator cancels. The weighted expert outputs sum identically.

No approximation. No numerical error beyond floating-point identity.

3. Combinatorial Routing Analysis

The number of distinct expert subsets per token per layer is C(E, k). This bounds the model's capacity for input-dependent specialization.

Per-Layer Configurations

Configuration	E	k	C(E, k)	vs gpt-oss-20b
gpt-oss-20b	32	4	35,960	1x
gpt-oss-120b	128	4	10,668,000	297x
64a8	64	8	4,426,165,368	123,091x
96a12	96	12	2.35 x 10^13	6.54 x 10^8 x
64a4 (post-training)	64	4	635,376	17.7x
96a4 (post-training)	96	4	3,321,960	92x

Over L Layers

Each layer routes independently. Total configuration space over the full network: C(E, k)^L.

Configuration	L	C(E, k)^L	Order of magnitude
gpt-oss-20b (32a4)	24	35,960^24	~10^109
gpt-oss-120b (128a4)	36	10,668,000^36	~10^253
64a8	24	(4.43 x 10^9)^24	~10^232
96a12	24	(6.25 x 10^14)^24	~10^355
64a4 (reduced)	24	635,376^24	~10^139
96a4 (reduced)	24	3,321,960^24	~10^157

These represent the theoretical configuration ceiling the optimizer can explore during fine-tuning. Practical utilization depends on dataset diversity and training strategy.

Interpretation

Activating 8 or 12 experts per token produces a richer per-token representation: each token is processed through more specialized views simultaneously. This is particularly relevant for tasks with high nonlinear reasoning demands.

Even after reduction to top-4 routing, the expanded models retain 17.7x to 92x more per-layer options than the original 20B, at equivalent inference cost.

4. Architecture

All editions share the same architecture. Only expert count, active count, and quantization format differ from gpt-oss-20b.

Unchanged: hidden_size (2880), num_hidden_layers (24), num_attention_heads (64), num_key_value_heads (8), head_dim (64), sliding_window (128), max_position_embeddings (131072), vocab_size (201088), RoPE (YaRN, factor 32).

Changed:

	gpt-oss-20b	64a8 editions	96a12 editions
`num_local_experts`	32	64	96
`experts_per_token`	4	8	12

Quantization format:

BnB editions: Expert MLP weights as individual Linear4bit modules (BitsAndBytes NF4). Attention, router, embeddings, norms in BF16. Each expert is a distinct module, directly addressable for LoRA or freezing.
MXFP4 editions: Expert MLP weights in MXFP4 (fused GptOssExperts packed tensors). Attention, router, embeddings, norms in BF16.

5. Usage

5.1 Choosing an Edition

	BnB 4-bit editions	MXFP4 editions
LoRA / QLoRA on individual experts	native	custom implementation required
Full-parameter expert training	supported	supported (selective gradient control)
Unsloth `FastLanguageModel`	direct	not available
vLLM / SGLang serving	conversion needed	native (tested, identical to gpt-oss MXFP4)
EAGLE3 speculative decoding	compatible	compatible

The BnB editions decompose each expert into separate Linear4bit modules. Standard LoRA applies to any expert projection. Simpler path for most fine-tuning workflows.

The MXFP4 editions use GptOssExperts, a fused packed module. LoRA applies to attention projections; expert-level adaptation requires either full-parameter training with gradient control, or custom LoRA adapters operating on the packed tensor structure.

5.2 QLoRA via Unsloth

This edition is compatible with Unsloth's GPT-OSS support. Refer to the official Unsloth guide for setup, model loading, and LoRA configuration:

Unsloth: How to Run and Fine-Tune GPT-OSS

Replace the model name with the appropriate BnB edition:

khoinguyenbk/llm-surgery-dark-arts-gpt-oss-40b-64a8-init (64a8)
khoinguyenbk/llm-surgery-dark-arts-gpt-oss-60b-96a12-init (96a12)

5.3 MXFP4 Editions

For vLLM-native serving or Transformers + TRL fine-tuning with MXFP4 format, see the MXFP4 editions:

gpt-oss-40b-64a8 (64a8)
gpt-oss-60b-96a12 (96a12)

6. Training Considerations

Expert Preservation and Symmetry Breaking

Expanded experts begin in a symmetric state. Fine-tuning naturally breaks symmetry through stochastic gradient updates, but the process can be guided.

Original expert weights encode the full pretrained capability of gpt-oss-20b. Preserving this knowledge base during early training is the primary concern. Selective gradient control allows original experts to serve as a stability anchor while expanded experts diverge and specialize.

The router must remain trainable throughout. It is the mechanism through which diversification manifests.

In the BnB editions, selective freezing is straightforward: each expert lives at model.model.layers[L].mlp.experts[i] as a distinct module with its own parameters. Freezing original experts (indices 0-31) and training expanded experts (32-63) requires only setting requires_grad or applying gradient hooks per module.

Diversification Approaches

Several approaches to accelerating symmetry breaking, applicable individually or in combination:

Gradient-based selective freezing: protect originals, train expanded experts
Noise injection on expanded expert weights: seed divergence before training begins
Router-aware scheduling: controlling which experts receive gradient signal at which stages
Expert-specific regularization: auxiliary terms encouraging divergence

The appropriate combination depends on the target domain, dataset composition, and available compute.

Staged Training

The general principle: protect, diversify, refine.

Early stages prioritize stability of original knowledge. Middle stages allow broad divergence at the new experts. Late stages consolidate specialization with global refinement at reduced learning rate.

The attention layers in gpt-oss-20b are well-converged. Aggressive fine-tuning of attention carries degradation risk. Minimal learning rate or full freeze on attention is worth considering, depending on domain distance from the pretraining distribution.

Stage boundaries, learning rate schedules, and freeze transitions are task-dependent.

Active Expert Reduction

Post-diversification, the active count can be reduced (64a8 to 64a4, 96a12 to 96a4) via router bias adjustment. This recovers gpt-oss-20b inference throughput while retaining 17.7x to 92x more routing options.

The 64a8 and 96a12 configurations are designed for high nonlinear reasoning capacity. Reduction is optional. For competition-grade mathematics, multi-step code generation, and adversarial reasoning, the full active count is recommended.

7. Prerequisites

	BnB editions	MXFP4 editions
`transformers`	>= 4.55.0	>= 4.55.0
`bitsandbytes`	required	not needed
`triton`	not needed	>= 3.4.0 (for `dequantize=False`)
`kernels`	not needed	required (for `dequantize=False`)
`peft`	>= 0.17.0	>= 0.17.0
`trl`	>= 0.20.0	>= 0.20.0
Response format	Harmony	Harmony
GPU (training)	>= 48GB (QLoRA)	>= 80GB (full-param experts)

8. Lineage

openai/gpt-oss-20b              32 experts · top-4 · 24L · ~21B
    |
    +--> unsloth/gpt-oss-20b-unsloth-bnb-4bit      (BnB 4-bit quantization)
    |       |
    |       +--> x2 (64a8) BnB                      64 experts · top-8 · 24L · ~40B
    |       +--> x3 (96a12) BnB                     96 experts · top-12 · 24L · ~60B
    |
    +--> x2 (64a8) MXFP4                            64 experts · top-8 · 24L · ~40B
    +--> x3 (96a12) MXFP4                           96 experts · top-12 · 24L · ~60B
    |
openai/gpt-oss-120b             128 experts · top-4 · 36L · ~117B

License

Apache 2.0, inherited from openai/gpt-oss-20b

References

Citation

@misc{llmsurgery2026gptoss,
  title={LLM Surgery Dark Arts: Silent-Init Expert Expansion for GPT-OSS-20B},
  author={Unnamed AI Lab},
  year={2026},
  note={Expert-expanded MoE init checkpoints: 64a8 and 96a12 configurations},
  url={https://huggingface.co/khoinguyenbk/llm-surgery-dark-arts-gpt-oss-40b-64a8-init}
}

Downloads last month: 84

Safetensors

Model size

40B params

Tensor type

F32

BF16

Model tree for khoinguyenbk/llm-surgery-dark-arts-gpt-oss-40b-64a8-init

Base model

openai/gpt-oss-20b

Quantized

unsloth/gpt-oss-20b-unsloth-bnb-4bit

Quantized

(206)

this model

Paper for khoinguyenbk/llm-surgery-dark-arts-gpt-oss-40b-64a8-init

gpt-oss-120b & gpt-oss-20b Model Card

Paper • 2508.10925 • Published Aug 8, 2025 • 20