LLM Surgery Dark Arts: GPT-OSS 40B (64 Experts, Active 8) · BnB 4-bit
| This edition | BnB 4-bit · 64 experts · top-8 · 24 layers · ~40B total · ~7.2B active |
| Base model | unsloth/gpt-oss-20b-unsloth-bnb-4bit : 32 experts · top-4 · 24 layers · ~21B |
| Status | Init checkpoint, identical to gpt-oss-20b, requires fine-tuning |
All Editions
| Edition | Experts | Active | Format | ~Params | Hub |
|---|---|---|---|---|---|
| 40B-64a8 | 64 | 8 | MXFP4 | 40B | gpt-oss-40b-64a8 |
| 60B-96a12 | 96 | 12 | MXFP4 | 60B | gpt-oss-60b-96a12 |
| 40B-64a8 | 64 | 8 | BnB 4-bit | 40B | gpt-oss-40b-64a8-init |
| 60B-96a12 | 96 | 12 | BnB 4-bit | 60B | gpt-oss-60b-96a12-init |
Init checkpoints, mathematically identical to gpt-oss-20b at initialization. Designed as expanded-capacity bases for fine-tuning on complex reasoning and domain specialization tasks.
1. Context
The Bottleneck
gpt-oss-20b achieves remarkable performance from a compact architecture: 32 experts, 4 active per token, 24 layers. For general use this is sufficient. For research labs pushing domain-specific reasoning (mathematical competition, complex code generation, scientific inference), the expert pool becomes the limiting factor. The hidden dimension and depth are adequate; the number of distinct expert combinations the router can assign is not.
Width Over Depth
Adding layers increases KV cache consumption linearly, directly reducing throughput and maximum context length. Adding experts with the same layer count leaves the KV cache untouched.
- Depth (more layers): linear KV cost, modest capacity gain
- Width (more experts, same layers): zero KV impact, exponential routing growth
These models expand the expert pool while preserving the 24-layer architecture exactly. Attention, positional encoding, vocabulary, sliding window: all unchanged. Only the MoE routing space grows.
The quality of the resulting fine-tuned model depends on dataset preparation and training strategy, specifically on ensuring the expanded expert pool is utilized diversely rather than collapsing back to redundant configurations.
Train Wide, Deploy Narrow
The 64a8 and 96a12 configurations activate more experts per token than the original (8 or 12 vs 4). This is intentional for training: more active experts means each token provides gradient signal to more parameters, accelerating diversification.
After training, the active count can be reduced via router bias adjustment to recover gpt-oss-20b-class throughput while retaining the broader expert pool. Train wide, deploy narrow.
2. Mathematical Foundation: Silent Init
Principle
The expert pool is expanded by factor M (x2 for 64a8, x3 for 96a12). Each new expert is initialized from an existing one. The router is expanded with duplicated structure. The active expert count scales by the same factor M.
Under softmax normalization, the multiplicity cancels exactly.
Proof
Standard MoE forward pass:
FFN(h) = Σ_{i ∈ top-k(s)} p_i(h) · E_i(h)
where s_i = W_r[i] · h + b_r[i] (router logits)
p_i = exp(s_i) / Σ_{j∈S} exp(s_j) (softmax over selected set S)
After expansion with multiplier M (expert count E' = M·E, active count k' = M·k), top-k' selects exactly M copies of each original top-k expert. Within the softmax:
Σ_{copies of i} p'_copy = M · exp(s_i) / (M · Σ_{orig top-k} exp(s_j)) = p_i
Factor M in numerator and denominator cancels. The weighted expert outputs sum identically.
No approximation. No numerical error beyond floating-point identity.
3. Combinatorial Routing Analysis
The number of distinct expert subsets per token per layer is C(E, k). This bounds the model's capacity for input-dependent specialization.
Per-Layer Configurations
| Configuration | E | k | C(E, k) | vs gpt-oss-20b |
|---|---|---|---|---|
| gpt-oss-20b | 32 | 4 | 35,960 | 1x |
| gpt-oss-120b | 128 | 4 | 10,668,000 | 297x |
| 64a8 | 64 | 8 | 4,426,165,368 | 123,091x |
| 96a12 | 96 | 12 | 2.35 x 10^13 | 6.54 x 10^8 x |
| 64a4 (post-training) | 64 | 4 | 635,376 | 17.7x |
| 96a4 (post-training) | 96 | 4 | 3,321,960 | 92x |
Over L Layers
Each layer routes independently. Total configuration space over the full network: C(E, k)^L.
| Configuration | L | C(E, k)^L | Order of magnitude |
|---|---|---|---|
| gpt-oss-20b (32a4) | 24 | 35,960^24 | ~10^109 |
| gpt-oss-120b (128a4) | 36 | 10,668,000^36 | ~10^253 |
| 64a8 | 24 | (4.43 x 10^9)^24 | ~10^232 |
| 96a12 | 24 | (6.25 x 10^14)^24 | ~10^355 |
| 64a4 (reduced) | 24 | 635,376^24 | ~10^139 |
| 96a4 (reduced) | 24 | 3,321,960^24 | ~10^157 |
These represent the theoretical configuration ceiling the optimizer can explore during fine-tuning. Practical utilization depends on dataset diversity and training strategy.
Interpretation
Activating 8 or 12 experts per token produces a richer per-token representation: each token is processed through more specialized views simultaneously. This is particularly relevant for tasks with high nonlinear reasoning demands.
Even after reduction to top-4 routing, the expanded models retain 17.7x to 92x more per-layer options than the original 20B, at equivalent inference cost.
4. Architecture
All editions share the same architecture. Only expert count, active count, and quantization format differ from gpt-oss-20b.
Unchanged: hidden_size (2880), num_hidden_layers (24), num_attention_heads (64), num_key_value_heads (8), head_dim (64), sliding_window (128), max_position_embeddings (131072), vocab_size (201088), RoPE (YaRN, factor 32).
Changed:
| gpt-oss-20b | 64a8 editions | 96a12 editions | |
|---|---|---|---|
num_local_experts |
32 | 64 | 96 |
experts_per_token |
4 | 8 | 12 |
Quantization format:
- BnB editions: Expert MLP weights as individual
Linear4bitmodules (BitsAndBytes NF4). Attention, router, embeddings, norms in BF16. Each expert is a distinct module, directly addressable for LoRA or freezing. - MXFP4 editions: Expert MLP weights in MXFP4 (fused
GptOssExpertspacked tensors). Attention, router, embeddings, norms in BF16.
5. Usage
5.1 Choosing an Edition
| BnB 4-bit editions | MXFP4 editions | |
|---|---|---|
| LoRA / QLoRA on individual experts | native | custom implementation required |
| Full-parameter expert training | supported | supported (selective gradient control) |
Unsloth FastLanguageModel |
direct | not available |
| vLLM / SGLang serving | conversion needed | native (tested, identical to gpt-oss MXFP4) |
| EAGLE3 speculative decoding | compatible | compatible |
The BnB editions decompose each expert into separate Linear4bit modules. Standard LoRA applies to any expert projection. Simpler path for most fine-tuning workflows.
The MXFP4 editions use GptOssExperts, a fused packed module. LoRA applies to attention projections; expert-level adaptation requires either full-parameter training with gradient control, or custom LoRA adapters operating on the packed tensor structure.
5.2 QLoRA via Unsloth
This edition is compatible with Unsloth's GPT-OSS support. Refer to the official Unsloth guide for setup, model loading, and LoRA configuration:
Replace the model name with the appropriate BnB edition:
khoinguyenbk/llm-surgery-dark-arts-gpt-oss-40b-64a8-init(64a8)khoinguyenbk/llm-surgery-dark-arts-gpt-oss-60b-96a12-init(96a12)
5.3 MXFP4 Editions
For vLLM-native serving or Transformers + TRL fine-tuning with MXFP4 format, see the MXFP4 editions:
gpt-oss-40b-64a8(64a8)gpt-oss-60b-96a12(96a12)
6. Training Considerations
Expert Preservation and Symmetry Breaking
Expanded experts begin in a symmetric state. Fine-tuning naturally breaks symmetry through stochastic gradient updates, but the process can be guided.
Original expert weights encode the full pretrained capability of gpt-oss-20b. Preserving this knowledge base during early training is the primary concern. Selective gradient control allows original experts to serve as a stability anchor while expanded experts diverge and specialize.
The router must remain trainable throughout. It is the mechanism through which diversification manifests.
In the BnB editions, selective freezing is straightforward: each expert lives at model.model.layers[L].mlp.experts[i] as a distinct module with its own parameters. Freezing original experts (indices 0-31) and training expanded experts (32-63) requires only setting requires_grad or applying gradient hooks per module.
Diversification Approaches
Several approaches to accelerating symmetry breaking, applicable individually or in combination:
- Gradient-based selective freezing: protect originals, train expanded experts
- Noise injection on expanded expert weights: seed divergence before training begins
- Router-aware scheduling: controlling which experts receive gradient signal at which stages
- Expert-specific regularization: auxiliary terms encouraging divergence
The appropriate combination depends on the target domain, dataset composition, and available compute.
Staged Training
The general principle: protect, diversify, refine.
Early stages prioritize stability of original knowledge. Middle stages allow broad divergence at the new experts. Late stages consolidate specialization with global refinement at reduced learning rate.
The attention layers in gpt-oss-20b are well-converged. Aggressive fine-tuning of attention carries degradation risk. Minimal learning rate or full freeze on attention is worth considering, depending on domain distance from the pretraining distribution.
Stage boundaries, learning rate schedules, and freeze transitions are task-dependent.
Active Expert Reduction
Post-diversification, the active count can be reduced (64a8 to 64a4, 96a12 to 96a4) via router bias adjustment. This recovers gpt-oss-20b inference throughput while retaining 17.7x to 92x more routing options.
The 64a8 and 96a12 configurations are designed for high nonlinear reasoning capacity. Reduction is optional. For competition-grade mathematics, multi-step code generation, and adversarial reasoning, the full active count is recommended.
7. Prerequisites
| BnB editions | MXFP4 editions | |
|---|---|---|
transformers |
>= 4.55.0 | >= 4.55.0 |
bitsandbytes |
required | not needed |
triton |
not needed | >= 3.4.0 (for dequantize=False) |
kernels |
not needed | required (for dequantize=False) |
peft |
>= 0.17.0 | >= 0.17.0 |
trl |
>= 0.20.0 | >= 0.20.0 |
| Response format | Harmony | Harmony |
| GPU (training) | >= 48GB (QLoRA) | >= 80GB (full-param experts) |
8. Lineage
openai/gpt-oss-20b 32 experts · top-4 · 24L · ~21B
|
+--> unsloth/gpt-oss-20b-unsloth-bnb-4bit (BnB 4-bit quantization)
| |
| +--> x2 (64a8) BnB 64 experts · top-8 · 24L · ~40B
| +--> x3 (96a12) BnB 96 experts · top-12 · 24L · ~60B
|
+--> x2 (64a8) MXFP4 64 experts · top-8 · 24L · ~40B
+--> x3 (96a12) MXFP4 96 experts · top-12 · 24L · ~60B
|
openai/gpt-oss-120b 128 experts · top-4 · 36L · ~117B
License
Apache 2.0, inherited from openai/gpt-oss-20b
References
- Unsloth: How to Run and Fine-Tune GPT-OSS
- OpenAI GPT-OSS Fine-Tuning Cookbook (Transformers)
- unsloth/gpt-oss-20b-unsloth-bnb-4bit
- openai/gpt-oss-20b · openai/gpt-oss-120b
- GPT-OSS Model Card, arXiv:2508.10925
- Harmony Response Format
- TRL · PEFT
Citation
@misc{llmsurgery2026gptoss,
title={LLM Surgery Dark Arts: Silent-Init Expert Expansion for GPT-OSS-20B},
author={Unnamed AI Lab},
year={2026},
note={Expert-expanded MoE init checkpoints: 64a8 and 96a12 configurations},
url={https://huggingface.co/khoinguyenbk/llm-surgery-dark-arts-gpt-oss-40b-64a8-init}
}
- Downloads last month
- 84
Model tree for khoinguyenbk/llm-surgery-dark-arts-gpt-oss-40b-64a8-init
Base model
openai/gpt-oss-20b