Gemma 4 A4B 98-Expert v3 (20.8B)

Expert-pruned version of google/gemma-4-26B-A4B-it, reduced from 128 to 98 experts per MoE layer using a contribution-weighted importance map aggregated across all task categories (math, logic, code, science, creative).

Original (128e) 109e v3 This model (98e v3)
Total params 26B 22.4B ~20.8B
Experts per layer 128 109 98
Experts dropped 19/layer 30/layer
MoE capacity removed 14.8% 23.4%
Top-k routing 8 8 8
GPQA Diamond (Q6_K) 75.25% 71.72% 75.25%

Zero GPQA degradation despite dropping 30 experts per layer (23.4% of MoE capacity). This model matches the full 128-expert reference exactly on GPQA Diamond.

GGUF quantizations available at ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF — includes standard Bartowski quants + ContribDynamic (CD) per-layer quants.

Pruning Method

Contribution-Weighted Expert Analysis

The drop map is derived from expert_neuron_v4.json — a comprehensive per-expert contribution analysis across all task categories (math, logic, code, science, creative) using 128-token teacher-force analysis on the full 128-expert reference model.

Process (scripts/expert_drop.py):

  1. Contribution scoring: For each expert in each layer, the total contribution (tc field) is computed as the sum of weighted output norms across all task categories.
  2. Per-layer ranking: Experts are ranked by total contribution within each layer.
  3. Drop decision: The 30 lowest-contributing experts per layer are dropped (128 → 98).
  4. Router resize: The MoE router proj.weight is resized from [128, hidden] to [98, hidden], keeping only rows for retained experts. The top-8 routing naturally adapts.

Why 98e Works Better Than 109e

The 98e v3 model uses a different importance map than 109e v3:

  • 109e v3: Uses a per-question top-16 protection scheme on GPQA Diamond specifically (teacher-force analysis, 196 questions). This over-fits the drop map to GPQA.
  • 98e v3: Uses expert_neuron_v4.json which aggregates contribution across all task categories with 128-token analysis windows. This produces a more generalizable importance ranking.

The result: 98e v3 matches the 128e reference (75.25%) while dropping 30 experts, whereas 109e v3 drops only 19 experts but loses 3.53 pp. The broader importance map makes better pruning decisions.

Key Findings

  • Experts are NOT topic-specialized: Top-32 overlap is 28/32 between math and creative domains. The same experts are important across tasks.
  • Contribution is moderately concentrated: Gini coefficient ~0.38. You need ~75 experts per layer for 80% of the contribution — the bottom 30 carry very little signal.
  • Expert weight similarity is near zero: Max cosine similarity ~0.05 between expert weight matrices. Merging experts by averaging destroys the model. Expert dropping is the only viable structural compression.
  • Early layers matter most: Layer 0 has highest importance (1.0), layers 28-29 are lowest (~0.04-0.05). But the drop is applied uniformly across layers.

GPQA Diamond Evaluation

Setup (identical methodology to 109e v3)

  • Quantization: GGUF Q6_K via llama.cpp llama-quantize (imatrix calibration)
  • Inference: llama.cpp llama-server (OpenAI-compatible API)
  • Evaluation: lm-evaluation-harness, task gpqa_diamond_cot_zeroshot
  • GPU: NVIDIA RTX 3090 (24 GB)

Configuration

Parameter Value
Context size 32768 tokens
Reasoning format deepseek
Reasoning budget 8192 tokens
Temperature 1.0 (Gemma 4 official)
top_p 0.95
top_k 64
DRY multiplier 0.5
Tokenizer google/gemma-4-26B-A4B-it (original)

Results

Model Experts/Layer GPQA Diamond (flex) Delta vs 128e
gemma-4-26B-A4B-it (128e ref) 128 75.25%
gemma-4-A4B-98e-v3-it (this) 98 75.25% +0.00 pp
gemma-4-A4B-109e-v3-it 109 71.72% -3.53 pp
gemma-4-E4B (small Gemma 4) 57.07% -18.18 pp

Architecture

Unchanged from the original except num_experts: 98 (was 128):

  • Layers: 30
  • Hidden size: 2816
  • Expert intermediate size: 704 per expert
  • Dense MLP intermediate size: 2112 (always active)
  • Top-k routing: 8 (of 98 available)
  • Attention: Hybrid sliding (5) + global (1) pattern, head_dim=512
  • Vocabulary: 262,144

Files

  • config.json — Model config with num_experts: 98
  • model-0000N-of-00009.safetensors — Model weights (bf16)
  • expert_drop.py — Deterministic expert pruning script

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "ManniX-ITA/gemma-4-A4B-98e-v3-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",  # Gemma 4 head_dim=512 not supported by FA2
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-A4B-98e-v3-it")

msgs = [{"role": "user", "content": "Explain the Heisenberg uncertainty principle."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=1.0, top_p=0.95, top_k=64)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

llama.cpp (recommended)

llama-server -m gemma-4-A4B-98e-v3-it-Q6_K.gguf \
    --port 8099 -c 32768 -ngl 99 --no-warmup \
    --reasoning-format deepseek --reasoning-budget 8192 \
    --temp 1.0 --top-p 0.95 --top-k 64 --dry-multiplier 0.5

GGUF quants: ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF

Related Models

Model Description
gemma-4-A4B-109e-v3-it 109 experts (19 dropped), clean teacher-force map
gemma-4-A4B-109e-v3-it-GGUF GGUF quants for 109e v3
gemma-4-A4B-98e-v3-it-GGUF GGUF quants for this model

License

This model inherits the Gemma license from the base model.

Acknowledgements

  • Google for the base Gemma 4 26B-A4B-it model
  • The GPQA Diamond benchmark (Rein et al., 2023)
  • bartowski for the calibration data v5 used in imatrix GGUF quantization
Downloads last month
31
Safetensors
Model size
20B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-98e-v3-it

Finetuned
(54)
this model
Quantizations
1 model