LLM Surgery Dark Arts: GPT-OSS 40B (64 Experts, Active 8) · BnB 4-bit

This edition BnB 4-bit · 64 experts · top-8 · 24 layers · ~40B total · ~7.2B active
Base model unsloth/gpt-oss-20b-unsloth-bnb-4bit : 32 experts · top-4 · 24 layers · ~21B
Status Init checkpoint, identical to gpt-oss-20b, requires fine-tuning

All Editions

Edition Experts Active Format ~Params Hub
40B-64a8 64 8 MXFP4 40B gpt-oss-40b-64a8
60B-96a12 96 12 MXFP4 60B gpt-oss-60b-96a12
40B-64a8 64 8 BnB 4-bit 40B gpt-oss-40b-64a8-init
60B-96a12 96 12 BnB 4-bit 60B gpt-oss-60b-96a12-init

Init checkpoints, mathematically identical to gpt-oss-20b at initialization. Designed as expanded-capacity bases for fine-tuning on complex reasoning and domain specialization tasks.


1. Context

The Bottleneck

gpt-oss-20b achieves remarkable performance from a compact architecture: 32 experts, 4 active per token, 24 layers. For general use this is sufficient. For research labs pushing domain-specific reasoning (mathematical competition, complex code generation, scientific inference), the expert pool becomes the limiting factor. The hidden dimension and depth are adequate; the number of distinct expert combinations the router can assign is not.

Width Over Depth

Adding layers increases KV cache consumption linearly, directly reducing throughput and maximum context length. Adding experts with the same layer count leaves the KV cache untouched.

  • Depth (more layers): linear KV cost, modest capacity gain
  • Width (more experts, same layers): zero KV impact, exponential routing growth

These models expand the expert pool while preserving the 24-layer architecture exactly. Attention, positional encoding, vocabulary, sliding window: all unchanged. Only the MoE routing space grows.

The quality of the resulting fine-tuned model depends on dataset preparation and training strategy, specifically on ensuring the expanded expert pool is utilized diversely rather than collapsing back to redundant configurations.

Train Wide, Deploy Narrow

The 64a8 and 96a12 configurations activate more experts per token than the original (8 or 12 vs 4). This is intentional for training: more active experts means each token provides gradient signal to more parameters, accelerating diversification.

After training, the active count can be reduced via router bias adjustment to recover gpt-oss-20b-class throughput while retaining the broader expert pool. Train wide, deploy narrow.


2. Mathematical Foundation: Silent Init

Principle

The expert pool is expanded by factor M (x2 for 64a8, x3 for 96a12). Each new expert is initialized from an existing one. The router is expanded with duplicated structure. The active expert count scales by the same factor M.

Under softmax normalization, the multiplicity cancels exactly.

Proof

Standard MoE forward pass:

FFN(h) = Σ_{i ∈ top-k(s)} p_i(h) · E_i(h)

where  s_i = W_r[i] · h + b_r[i]           (router logits)
       p_i = exp(s_i) / Σ_{j∈S} exp(s_j)   (softmax over selected set S)

After expansion with multiplier M (expert count E' = M·E, active count k' = M·k), top-k' selects exactly M copies of each original top-k expert. Within the softmax:

Σ_{copies of i} p'_copy = M · exp(s_i) / (M · Σ_{orig top-k} exp(s_j)) = p_i

Factor M in numerator and denominator cancels. The weighted expert outputs sum identically.

No approximation. No numerical error beyond floating-point identity.


3. Combinatorial Routing Analysis

The number of distinct expert subsets per token per layer is C(E, k). This bounds the model's capacity for input-dependent specialization.

Per-Layer Configurations

Configuration E k C(E, k) vs gpt-oss-20b
gpt-oss-20b 32 4 35,960 1x
gpt-oss-120b 128 4 10,668,000 297x
64a8 64 8 4,426,165,368 123,091x
96a12 96 12 2.35 x 10^13 6.54 x 10^8 x
64a4 (post-training) 64 4 635,376 17.7x
96a4 (post-training) 96 4 3,321,960 92x

Over L Layers

Each layer routes independently. Total configuration space over the full network: C(E, k)^L.

Configuration L C(E, k)^L Order of magnitude
gpt-oss-20b (32a4) 24 35,960^24 ~10^109
gpt-oss-120b (128a4) 36 10,668,000^36 ~10^253
64a8 24 (4.43 x 10^9)^24 ~10^232
96a12 24 (6.25 x 10^14)^24 ~10^355
64a4 (reduced) 24 635,376^24 ~10^139
96a4 (reduced) 24 3,321,960^24 ~10^157

These represent the theoretical configuration ceiling the optimizer can explore during fine-tuning. Practical utilization depends on dataset diversity and training strategy.

Interpretation

Activating 8 or 12 experts per token produces a richer per-token representation: each token is processed through more specialized views simultaneously. This is particularly relevant for tasks with high nonlinear reasoning demands.

Even after reduction to top-4 routing, the expanded models retain 17.7x to 92x more per-layer options than the original 20B, at equivalent inference cost.


4. Architecture

All editions share the same architecture. Only expert count, active count, and quantization format differ from gpt-oss-20b.

Unchanged: hidden_size (2880), num_hidden_layers (24), num_attention_heads (64), num_key_value_heads (8), head_dim (64), sliding_window (128), max_position_embeddings (131072), vocab_size (201088), RoPE (YaRN, factor 32).

Changed:

gpt-oss-20b 64a8 editions 96a12 editions
num_local_experts 32 64 96
experts_per_token 4 8 12

Quantization format:

  • BnB editions: Expert MLP weights as individual Linear4bit modules (BitsAndBytes NF4). Attention, router, embeddings, norms in BF16. Each expert is a distinct module, directly addressable for LoRA or freezing.
  • MXFP4 editions: Expert MLP weights in MXFP4 (fused GptOssExperts packed tensors). Attention, router, embeddings, norms in BF16.

5. Usage

5.1 Choosing an Edition

BnB 4-bit editions MXFP4 editions
LoRA / QLoRA on individual experts native custom implementation required
Full-parameter expert training supported supported (selective gradient control)
Unsloth FastLanguageModel direct not available
vLLM / SGLang serving conversion needed native (tested, identical to gpt-oss MXFP4)
EAGLE3 speculative decoding compatible compatible

The BnB editions decompose each expert into separate Linear4bit modules. Standard LoRA applies to any expert projection. Simpler path for most fine-tuning workflows.

The MXFP4 editions use GptOssExperts, a fused packed module. LoRA applies to attention projections; expert-level adaptation requires either full-parameter training with gradient control, or custom LoRA adapters operating on the packed tensor structure.

5.2 QLoRA via Unsloth

This edition is compatible with Unsloth's GPT-OSS support. Refer to the official Unsloth guide for setup, model loading, and LoRA configuration:

Unsloth: How to Run and Fine-Tune GPT-OSS

Replace the model name with the appropriate BnB edition:

  • khoinguyenbk/llm-surgery-dark-arts-gpt-oss-40b-64a8-init (64a8)
  • khoinguyenbk/llm-surgery-dark-arts-gpt-oss-60b-96a12-init (96a12)

5.3 MXFP4 Editions

For vLLM-native serving or Transformers + TRL fine-tuning with MXFP4 format, see the MXFP4 editions:


6. Training Considerations

Expert Preservation and Symmetry Breaking

Expanded experts begin in a symmetric state. Fine-tuning naturally breaks symmetry through stochastic gradient updates, but the process can be guided.

Original expert weights encode the full pretrained capability of gpt-oss-20b. Preserving this knowledge base during early training is the primary concern. Selective gradient control allows original experts to serve as a stability anchor while expanded experts diverge and specialize.

The router must remain trainable throughout. It is the mechanism through which diversification manifests.

In the BnB editions, selective freezing is straightforward: each expert lives at model.model.layers[L].mlp.experts[i] as a distinct module with its own parameters. Freezing original experts (indices 0-31) and training expanded experts (32-63) requires only setting requires_grad or applying gradient hooks per module.

Diversification Approaches

Several approaches to accelerating symmetry breaking, applicable individually or in combination:

  • Gradient-based selective freezing: protect originals, train expanded experts
  • Noise injection on expanded expert weights: seed divergence before training begins
  • Router-aware scheduling: controlling which experts receive gradient signal at which stages
  • Expert-specific regularization: auxiliary terms encouraging divergence

The appropriate combination depends on the target domain, dataset composition, and available compute.

Staged Training

The general principle: protect, diversify, refine.

Early stages prioritize stability of original knowledge. Middle stages allow broad divergence at the new experts. Late stages consolidate specialization with global refinement at reduced learning rate.

The attention layers in gpt-oss-20b are well-converged. Aggressive fine-tuning of attention carries degradation risk. Minimal learning rate or full freeze on attention is worth considering, depending on domain distance from the pretraining distribution.

Stage boundaries, learning rate schedules, and freeze transitions are task-dependent.

Active Expert Reduction

Post-diversification, the active count can be reduced (64a8 to 64a4, 96a12 to 96a4) via router bias adjustment. This recovers gpt-oss-20b inference throughput while retaining 17.7x to 92x more routing options.

The 64a8 and 96a12 configurations are designed for high nonlinear reasoning capacity. Reduction is optional. For competition-grade mathematics, multi-step code generation, and adversarial reasoning, the full active count is recommended.


7. Prerequisites

BnB editions MXFP4 editions
transformers >= 4.55.0 >= 4.55.0
bitsandbytes required not needed
triton not needed >= 3.4.0 (for dequantize=False)
kernels not needed required (for dequantize=False)
peft >= 0.17.0 >= 0.17.0
trl >= 0.20.0 >= 0.20.0
Response format Harmony Harmony
GPU (training) >= 48GB (QLoRA) >= 80GB (full-param experts)

8. Lineage

openai/gpt-oss-20b              32 experts · top-4 · 24L · ~21B
    |
    +--> unsloth/gpt-oss-20b-unsloth-bnb-4bit      (BnB 4-bit quantization)
    |       |
    |       +--> x2 (64a8) BnB                      64 experts · top-8 · 24L · ~40B
    |       +--> x3 (96a12) BnB                     96 experts · top-12 · 24L · ~60B
    |
    +--> x2 (64a8) MXFP4                            64 experts · top-8 · 24L · ~40B
    +--> x3 (96a12) MXFP4                           96 experts · top-12 · 24L · ~60B
    |
openai/gpt-oss-120b             128 experts · top-4 · 36L · ~117B

License

Apache 2.0, inherited from openai/gpt-oss-20b

References

Citation

@misc{llmsurgery2026gptoss,
  title={LLM Surgery Dark Arts: Silent-Init Expert Expansion for GPT-OSS-20B},
  author={Unnamed AI Lab},
  year={2026},
  note={Expert-expanded MoE init checkpoints: 64a8 and 96a12 configurations},
  url={https://huggingface.co/khoinguyenbk/llm-surgery-dark-arts-gpt-oss-40b-64a8-init}
}
Downloads last month
84
Safetensors
Model size
40B params
Tensor type
F32
·
BF16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for khoinguyenbk/llm-surgery-dark-arts-gpt-oss-40b-64a8-init

Quantized
(206)
this model

Paper for khoinguyenbk/llm-surgery-dark-arts-gpt-oss-40b-64a8-init