Gemma 4 A4B 98-Expert v3 (20.8B)
Expert-pruned version of google/gemma-4-26B-A4B-it, reduced from 128 to 98 experts per MoE layer using a contribution-weighted importance map aggregated across all task categories (math, logic, code, science, creative).
| Original (128e) | 109e v3 | This model (98e v3) | |
|---|---|---|---|
| Total params | 26B | 22.4B | ~20.8B |
| Experts per layer | 128 | 109 | 98 |
| Experts dropped | — | 19/layer | 30/layer |
| MoE capacity removed | — | 14.8% | 23.4% |
| Top-k routing | 8 | 8 | 8 |
| GPQA Diamond (Q6_K) | 75.25% | 71.72% | 75.25% |
Zero GPQA degradation despite dropping 30 experts per layer (23.4% of MoE capacity). This model matches the full 128-expert reference exactly on GPQA Diamond.
GGUF quantizations available at ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF — includes standard Bartowski quants + ContribDynamic (CD) per-layer quants.
Pruning Method
Contribution-Weighted Expert Analysis
The drop map is derived from expert_neuron_v4.json — a comprehensive per-expert contribution analysis across all task categories (math, logic, code, science, creative) using 128-token teacher-force analysis on the full 128-expert reference model.
Process (scripts/expert_drop.py):
- Contribution scoring: For each expert in each layer, the total contribution (
tcfield) is computed as the sum of weighted output norms across all task categories. - Per-layer ranking: Experts are ranked by total contribution within each layer.
- Drop decision: The 30 lowest-contributing experts per layer are dropped (128 → 98).
- Router resize: The MoE router
proj.weightis resized from[128, hidden]to[98, hidden], keeping only rows for retained experts. The top-8 routing naturally adapts.
Why 98e Works Better Than 109e
The 98e v3 model uses a different importance map than 109e v3:
- 109e v3: Uses a per-question top-16 protection scheme on GPQA Diamond specifically (teacher-force analysis, 196 questions). This over-fits the drop map to GPQA.
- 98e v3: Uses
expert_neuron_v4.jsonwhich aggregates contribution across all task categories with 128-token analysis windows. This produces a more generalizable importance ranking.
The result: 98e v3 matches the 128e reference (75.25%) while dropping 30 experts, whereas 109e v3 drops only 19 experts but loses 3.53 pp. The broader importance map makes better pruning decisions.
Key Findings
- Experts are NOT topic-specialized: Top-32 overlap is 28/32 between math and creative domains. The same experts are important across tasks.
- Contribution is moderately concentrated: Gini coefficient ~0.38. You need ~75 experts per layer for 80% of the contribution — the bottom 30 carry very little signal.
- Expert weight similarity is near zero: Max cosine similarity ~0.05 between expert weight matrices. Merging experts by averaging destroys the model. Expert dropping is the only viable structural compression.
- Early layers matter most: Layer 0 has highest importance (1.0), layers 28-29 are lowest (~0.04-0.05). But the drop is applied uniformly across layers.
GPQA Diamond Evaluation
Setup (identical methodology to 109e v3)
- Quantization: GGUF Q6_K via llama.cpp
llama-quantize(imatrix calibration) - Inference: llama.cpp
llama-server(OpenAI-compatible API) - Evaluation: lm-evaluation-harness, task
gpqa_diamond_cot_zeroshot - GPU: NVIDIA RTX 3090 (24 GB)
Configuration
| Parameter | Value |
|---|---|
| Context size | 32768 tokens |
| Reasoning format | deepseek |
| Reasoning budget | 8192 tokens |
| Temperature | 1.0 (Gemma 4 official) |
| top_p | 0.95 |
| top_k | 64 |
| DRY multiplier | 0.5 |
| Tokenizer | google/gemma-4-26B-A4B-it (original) |
Results
| Model | Experts/Layer | GPQA Diamond (flex) | Delta vs 128e |
|---|---|---|---|
| gemma-4-26B-A4B-it (128e ref) | 128 | 75.25% | — |
| gemma-4-A4B-98e-v3-it (this) | 98 | 75.25% | +0.00 pp |
| gemma-4-A4B-109e-v3-it | 109 | 71.72% | -3.53 pp |
| gemma-4-E4B (small Gemma 4) | — | 57.07% | -18.18 pp |
Architecture
Unchanged from the original except num_experts: 98 (was 128):
- Layers: 30
- Hidden size: 2816
- Expert intermediate size: 704 per expert
- Dense MLP intermediate size: 2112 (always active)
- Top-k routing: 8 (of 98 available)
- Attention: Hybrid sliding (5) + global (1) pattern,
head_dim=512 - Vocabulary: 262,144
Files
config.json— Model config withnum_experts: 98model-0000N-of-00009.safetensors— Model weights (bf16)expert_drop.py— Deterministic expert pruning script
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"ManniX-ITA/gemma-4-A4B-98e-v3-it",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="eager", # Gemma 4 head_dim=512 not supported by FA2
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-A4B-98e-v3-it")
msgs = [{"role": "user", "content": "Explain the Heisenberg uncertainty principle."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=1.0, top_p=0.95, top_k=64)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
llama.cpp (recommended)
llama-server -m gemma-4-A4B-98e-v3-it-Q6_K.gguf \
--port 8099 -c 32768 -ngl 99 --no-warmup \
--reasoning-format deepseek --reasoning-budget 8192 \
--temp 1.0 --top-p 0.95 --top-k 64 --dry-multiplier 0.5
GGUF quants: ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF
Related Models
| Model | Description |
|---|---|
| gemma-4-A4B-109e-v3-it | 109 experts (19 dropped), clean teacher-force map |
| gemma-4-A4B-109e-v3-it-GGUF | GGUF quants for 109e v3 |
| gemma-4-A4B-98e-v3-it-GGUF | GGUF quants for this model |
License
This model inherits the Gemma license from the base model.
Acknowledgements
- Google for the base Gemma 4 26B-A4B-it model
- The GPQA Diamond benchmark (Rein et al., 2023)
- bartowski for the calibration data v5 used in imatrix GGUF quantization
- Downloads last month
- 31