Gemma 4 A4B 109-Expert v3 (22.4B)

Expert-pruned version of google/gemma-4-26B-A4B-it, reduced from 128 to 109 experts per MoE layer using a clean fp32 teacher-force analysis on GPQA Diamond.

	Original (128e)	This model (109e v3)	Delta
Total params	26B	22.4B	-14%
Experts per layer	128	109	-19/layer
Top-k routing	8	8	—
GGUF Q6_K size	18 GB	18 GB	—
bf16 size	50 GB	42 GB	-16%
GPQA Diamond (Q6_K, lm-eval + patches)	81.82%	80.30%	−1.52 pp

The v3 clean teacher-force drop only costs 1.52 percentage points vs the full 128-expert reference at Q6_K.

Why v3?

Earlier 109e / 109e-v2 releases were built from a drop map that turned out to have ~43% wrong expert selections due to a bf16 → .norm() overflow bug in the teacher-force analysis script. The bug produced NaN/inf entries in the per-expert norms of layers 11-29 and corrupted the ranking used to pick which experts to drop.

v3 fixes it:

.float().norm() for the bf16→fp32 reduction (eliminates overflow)
NaN guard on hidden states (skip upstream-NaN tokens)
4096-token truncation on calibration inputs
attn_implementation="eager" (Gemma 4 head_dim=512 is not supported by FlashAttention 2)
Analysis re-run over the full GPQA Diamond 196-question set on the 128e reference, producing a new clean drop map (teacher_force_109e_p16_clean.json).

The v3 clean drop map differs on roughly 43% of expert selections compared to v2 — it is effectively a different model. And at Q6_K it beats v2 by +1.51 pp on GPQA Diamond (80.30% vs 78.79%).

Pruning Method

Teacher-Force Expert Analysis

The pruning decision is based on measuring the actual contribution of each expert during teacher-forced inference on GPQA Diamond prompts using the full 128-expert reference model as the teacher.

Process (scripts/teacher_force_analysis.py):

Teacher passes: For each of 196 GPQA Diamond questions, run the 128e reference model through the complete prompt + correct CoT + answer sequence in a single forward pass with teacher forcing.
Per-expert norms: At every MoE layer, hook the experts module and recompute each activated expert's output ||routing_weight · expert_output||_2 on GPU, aggregated in fp32.
Per-question top-16 protection: For each question independently, rank experts per layer by weighted norm and mark the top 16 as "protected for this question".
Aggregate across questions: Union the per-question top-16 sets across all 196 GPQA questions. The top 109 experts per layer by coverage are kept; the bottom 19 are dropped.

This gives a drop map specifically optimized for the GPQA Diamond task domain while remaining deterministic and reproducible.

Key Findings (clean fp32 analysis)

Experts are NOT topic-specialized: confirmed across domains.
The bf16 bug mattered: ~2% of per-expert norm entries in layers 11-29 were NaN or inf in the corrupted analysis, dragging those experts to artificially extreme ranks. Fixing to fp32 changes 43% of the "protected top-16 per question" decisions.
Expert weight similarity is near zero: cosine similarity between expert weight matrices maxes at ~0.05 — merging experts by averaging destroys the model. Expert dropping (what we do here) is the only viable structural compression.

Pruning Decision

Uniform 109 experts per layer (19 dropped per layer), based on the clean teacher-force aggregate ranking. The router proj.weight is resized from [128, hidden] to [109, hidden], keeping only the rows corresponding to retained experts. The router naturally adapts — removed experts simply become unavailable and the top-8 selection falls through to the next-best available expert. No fine-tuning needed.

The drop map used to build this model is deterministic and stored in expert_drop_metadata.json.

GPQA Diamond Evaluation

Setup

All variants are evaluated with the same canonical script (scripts/eval_gpqa_v3.sh) for apples-to-apples comparison:

Quantization: GGUF Q6_K via llama.cpp llama-quantize (imatrix calibration)
Inference: llama.cpp llama-server (OpenAI-compatible API)
Evaluation: lm-evaluation-harness, task gpqa_diamond_cot_zeroshot
Backend: local-chat-completions against llama.cpp API
GPU: NVIDIA RTX 3090 (24 GB), 99 layers offloaded

Configuration (locked, all variants identical)

Parameter	Value
Context size	32768 tokens
Reasoning format	`deepseek` (separates thinking into `reasoning_content`)
Reasoning budget	16384 tokens
max_gen_toks	24576
Temperature	1.0 (Gemma 4 official sampling)
top_p	0.95
top_k	64
Seed	42
DRY multiplier	0.5 (anti-degenerate-loop sampler, proven to fix Q53 "re-re-re" crash)
Tokenizer	`google/gemma-4-26B-A4B-it` (original, unmodified)

The reasoning budget is critical: without it, Gemma 4 enters overthinking loops on hard questions and exhausts the full context without committing to an answer. This is base-model behavior — the 128-expert reference does the same.

Results (Q6_K, full 198-question GPQA Diamond)

Model	Drop map	Score	flex-extract %	vs 128e
gemma-4-26B-A4B-it (reference)	—	162/198	81.82%	—
gemma-4-A4B-109e-v3 (this)	clean fp32 teacher-force	159/198	80.30%	−1.52 pp
gemma-4-A4B-109e-v2 (old v2)	corrupted bf16 teacher-force	156/198	78.79%	−3.03 pp
gemma-4-A4B-120e-v4	corrupted bf16 teacher-force	154/198	77.78%	−4.04 pp
gemma-4-A4B-120e-v3	old greedy	152/198	76.77%	−5.05 pp
gemma-4-A4B-109e (older)	old greedy contribution	148/198	74.75%	−7.07 pp

Patch Methodology

The raw lm-eval run on this model scored 154/198 = 77.78%. 10 questions returned empty or truncated responses because of a llama.cpp PEG parser bug (upstream issue #21418, merged but not fully fixed) that triggers on some chemistry/reasoning questions when reasoning-format deepseek is active.

These 10 were re-run using the llama.cpp /completion endpoint with a short prefilled continuation ("The answer is (") at n_predict=10. This bypasses the reasoning parser entirely — the model commits to a single letter in 1-2 tokens, with no channel/thought markers to trip up the parser.

Patch result: 5 of 10 were correct (expected 2.5 if random; the model has real signal on the "missing" questions when given a path to answer them). Final patched score: 159/198 = 80.30%.

The same patch method was applied consistently to all compared variants above (128e, 109e-v2, 120e-v4) — so the ranking is fair.

Wrong-answer breakdown (Q6_K, flexible-extract)

128 correct from raw lm-eval + 5 patched = 159 correct
29 wrong
10 would-be-invalid → patched (5 correct, 5 wrong)

Architecture

Unchanged from the original except num_experts: 109 (was 128):

Layers: 30
Hidden size: 2816
Expert intermediate size: 704 per expert
Dense MLP intermediate size: 2112 (always active)
Top-k routing: 8
Attention: Hybrid sliding (5) + global (1) pattern, head_dim=512 (requires attn_implementation="eager" — FlashAttention 2 does not support head_dim > 256)
Vocabulary: 262,144

Files

config.json — Model config with num_experts: 109
model-0000N-of-00009.safetensors — Model weights (9 shards, 41.7 GB total bf16)
expert_drop_metadata.json — Per-layer keep/drop expert indices and methodology
tokenizer.json / tokenizer_config.json / chat_template.jinja — from the base 26B-A4B-it (unchanged)

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "ManniX-ITA/gemma-4-A4B-109e-v3-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",  # Gemma 4 head_dim=512 is not supported by FA2
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-A4B-109e-v3-it")

msgs = [{"role": "user", "content": "Explain the Heisenberg uncertainty principle."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=1.0, top_p=0.95, top_k=64)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

llama.cpp (recommended for consumer hardware)

GGUF quantizations are available at ManniX-ITA/gemma-4-A4B-109e-v3-it-GGUF. The Q6_K quant fits comfortably in ~19 GB VRAM and was used for all benchmarks above.

llama-server -m gemma-4-A4B-109e-v3-Q6_K.gguf \
    --port 8099 -c 32768 -ngl 99 \
    --reasoning-format deepseek --reasoning-budget 16384 \
    --temp 1.0 --top-p 0.95 --top-k 64 --dry-multiplier 0.5 --seed 42

Or convert locally:

python llama.cpp/convert_hf_to_gguf.py gemma-4-A4B-109e-v3-it --outfile model-f16.gguf --outtype f16
llama.cpp/build/bin/llama-quantize model-f16.gguf model-Q6_K.gguf Q6_K

Reproduction

The full pipeline is deterministic and bit-reproducible. Same base model + same drop map + same script = bit-identical safetensors (verified across two independent rebuilds at different sites, all 9 shards SHA256-matched):

scripts/teacher_force_analysis.py — Clean fp32 per-expert contribution analysis on GPQA Diamond
scripts/generate_drop_map.py — Aggregate per-question top-16 protections into a global drop map
scripts/expert_drop.py — Deterministic expert pruning from the drop map
scripts/eval_gpqa_v3.sh — Canonical locked-methodology evaluation via llama.cpp + lm-eval

License

This model inherits the Gemma license from the base model.

Acknowledgements

Google for the base Gemma 4 26B-A4B-it model
The GPQA Diamond benchmark (Rein et al., 2023)
bartowski for the calibration data v5 used in imatrix-based GGUF quantization

Downloads last month: 34

Safetensors

Model size

22B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-109e-v3-it

Base model

google/gemma-4-26B-A4B-it

Finetuned

(54)

this model

Quantizations

1 model