Gemma 4 A4B 98-Expert v3 (20.8B)

Expert-pruned version of google/gemma-4-26B-A4B-it, reduced from 128 to 98 experts per MoE layer using a contribution-weighted importance map aggregated across all task categories (math, logic, code, science, creative).

	Original (128e)	109e v3	This model (98e v3)
Total params	26B	22.4B	~20.8B
Experts per layer	128	109	98
Experts dropped	—	19/layer	30/layer
MoE capacity removed	—	14.8%	23.4%
Top-k routing	8	8	8
GPQA Diamond (Q6_K)	75.25%	71.72%	75.25%

Zero GPQA degradation despite dropping 30 experts per layer (23.4% of MoE capacity). This model matches the full 128-expert reference exactly on GPQA Diamond.

GGUF quantizations available at ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF — includes standard Bartowski quants + ContribDynamic (CD) per-layer quants.

Pruning Method

Contribution-Weighted Expert Analysis

The drop map is derived from expert_neuron_v4.json — a comprehensive per-expert contribution analysis across all task categories (math, logic, code, science, creative) using 128-token teacher-force analysis on the full 128-expert reference model.

Process (scripts/expert_drop.py):

Contribution scoring: For each expert in each layer, the total contribution (tc field) is computed as the sum of weighted output norms across all task categories.
Per-layer ranking: Experts are ranked by total contribution within each layer.
Drop decision: The 30 lowest-contributing experts per layer are dropped (128 → 98).
Router resize: The MoE router proj.weight is resized from [128, hidden] to [98, hidden], keeping only rows for retained experts. The top-8 routing naturally adapts.

Why 98e Works Better Than 109e

The 98e v3 model uses a different importance map than 109e v3:

109e v3: Uses a per-question top-16 protection scheme on GPQA Diamond specifically (teacher-force analysis, 196 questions). This over-fits the drop map to GPQA.
98e v3: Uses expert_neuron_v4.json which aggregates contribution across all task categories with 128-token analysis windows. This produces a more generalizable importance ranking.

The result: 98e v3 matches the 128e reference (75.25%) while dropping 30 experts, whereas 109e v3 drops only 19 experts but loses 3.53 pp. The broader importance map makes better pruning decisions.

Key Findings

Experts are NOT topic-specialized: Top-32 overlap is 28/32 between math and creative domains. The same experts are important across tasks.
Contribution is moderately concentrated: Gini coefficient ~0.38. You need ~75 experts per layer for 80% of the contribution — the bottom 30 carry very little signal.
Expert weight similarity is near zero: Max cosine similarity ~0.05 between expert weight matrices. Merging experts by averaging destroys the model. Expert dropping is the only viable structural compression.
Early layers matter most: Layer 0 has highest importance (1.0), layers 28-29 are lowest (~0.04-0.05). But the drop is applied uniformly across layers.

GPQA Diamond Evaluation

Setup (identical methodology to 109e v3)

Quantization: GGUF Q6_K via llama.cpp llama-quantize (imatrix calibration)
Inference: llama.cpp llama-server (OpenAI-compatible API)
Evaluation: lm-evaluation-harness, task gpqa_diamond_cot_zeroshot
GPU: NVIDIA RTX 3090 (24 GB)

Configuration

Parameter	Value
Context size	32768 tokens
Reasoning format	`deepseek`
Reasoning budget	8192 tokens
Temperature	1.0 (Gemma 4 official)
top_p	0.95
top_k	64
DRY multiplier	0.5
Tokenizer	`google/gemma-4-26B-A4B-it` (original)

Results

Model	Experts/Layer	GPQA Diamond (flex)	Delta vs 128e
gemma-4-26B-A4B-it (128e ref)	128	75.25%	—
gemma-4-A4B-98e-v3-it (this)	98	75.25%	+0.00 pp
gemma-4-A4B-109e-v3-it	109	71.72%	-3.53 pp
gemma-4-E4B (small Gemma 4)	—	57.07%	-18.18 pp

Architecture

Unchanged from the original except num_experts: 98 (was 128):

Layers: 30
Hidden size: 2816
Expert intermediate size: 704 per expert
Dense MLP intermediate size: 2112 (always active)
Top-k routing: 8 (of 98 available)
Attention: Hybrid sliding (5) + global (1) pattern, head_dim=512
Vocabulary: 262,144

Files

config.json — Model config with num_experts: 98
model-0000N-of-00009.safetensors — Model weights (bf16)
expert_drop.py — Deterministic expert pruning script

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "ManniX-ITA/gemma-4-A4B-98e-v3-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",  # Gemma 4 head_dim=512 not supported by FA2
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-A4B-98e-v3-it")

msgs = [{"role": "user", "content": "Explain the Heisenberg uncertainty principle."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=1.0, top_p=0.95, top_k=64)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

llama.cpp (recommended)

llama-server -m gemma-4-A4B-98e-v3-it-Q6_K.gguf \
    --port 8099 -c 32768 -ngl 99 --no-warmup \
    --reasoning-format deepseek --reasoning-budget 8192 \
    --temp 1.0 --top-p 0.95 --top-k 64 --dry-multiplier 0.5

GGUF quants: ManniX-ITA/gemma-4-A4B-98e-v3-it-GGUF

Related Models

Model	Description
gemma-4-A4B-109e-v3-it	109 experts (19 dropped), clean teacher-force map
gemma-4-A4B-109e-v3-it-GGUF	GGUF quants for 109e v3
gemma-4-A4B-98e-v3-it-GGUF	GGUF quants for this model

License

This model inherits the Gemma license from the base model.

Acknowledgements

Google for the base Gemma 4 26B-A4B-it model
The GPQA Diamond benchmark (Rein et al., 2023)
bartowski for the calibration data v5 used in imatrix GGUF quantization

Downloads last month: 31

Safetensors

Model size

20B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-98e-v3-it

Base model

google/gemma-4-26B-A4B-it

Finetuned

(54)

this model

Quantizations

1 model