Gemma 4 A4B 109-Expert v3 (22.4B)
Expert-pruned version of google/gemma-4-26B-A4B-it, reduced from 128 to 109 experts per MoE layer using a clean fp32 teacher-force analysis on GPQA Diamond.
| Original (128e) | This model (109e v3) | Delta | |
|---|---|---|---|
| Total params | 26B | 22.4B | -14% |
| Experts per layer | 128 | 109 | -19/layer |
| Top-k routing | 8 | 8 | — |
| GGUF Q6_K size | 18 GB | 18 GB | — |
| bf16 size | 50 GB | 42 GB | -16% |
| GPQA Diamond (Q6_K, lm-eval + patches) | 81.82% | 80.30% | −1.52 pp |
The v3 clean teacher-force drop only costs 1.52 percentage points vs the full 128-expert reference at Q6_K.
Why v3?
Earlier 109e / 109e-v2 releases were built from a drop map that turned out to have ~43% wrong expert selections due to a bf16 → .norm() overflow bug in the teacher-force analysis script. The bug produced NaN/inf entries in the per-expert norms of layers 11-29 and corrupted the ranking used to pick which experts to drop.
v3 fixes it:
.float().norm()for the bf16→fp32 reduction (eliminates overflow)- NaN guard on hidden states (skip upstream-NaN tokens)
- 4096-token truncation on calibration inputs
attn_implementation="eager"(Gemma 4head_dim=512is not supported by FlashAttention 2)- Analysis re-run over the full GPQA Diamond 196-question set on the 128e reference, producing a new clean drop map (
teacher_force_109e_p16_clean.json).
The v3 clean drop map differs on roughly 43% of expert selections compared to v2 — it is effectively a different model. And at Q6_K it beats v2 by +1.51 pp on GPQA Diamond (80.30% vs 78.79%).
Pruning Method
Teacher-Force Expert Analysis
The pruning decision is based on measuring the actual contribution of each expert during teacher-forced inference on GPQA Diamond prompts using the full 128-expert reference model as the teacher.
Process (scripts/teacher_force_analysis.py):
- Teacher passes: For each of 196 GPQA Diamond questions, run the 128e reference model through the complete
prompt + correct CoT + answersequence in a single forward pass with teacher forcing. - Per-expert norms: At every MoE layer, hook the experts module and recompute each activated expert's output
||routing_weight · expert_output||_2on GPU, aggregated in fp32. - Per-question top-16 protection: For each question independently, rank experts per layer by weighted norm and mark the top 16 as "protected for this question".
- Aggregate across questions: Union the per-question top-16 sets across all 196 GPQA questions. The top 109 experts per layer by coverage are kept; the bottom 19 are dropped.
This gives a drop map specifically optimized for the GPQA Diamond task domain while remaining deterministic and reproducible.
Key Findings (clean fp32 analysis)
- Experts are NOT topic-specialized: confirmed across domains.
- The bf16 bug mattered: ~2% of per-expert norm entries in layers 11-29 were
NaNorinfin the corrupted analysis, dragging those experts to artificially extreme ranks. Fixing to fp32 changes 43% of the "protected top-16 per question" decisions. - Expert weight similarity is near zero: cosine similarity between expert weight matrices maxes at ~0.05 — merging experts by averaging destroys the model. Expert dropping (what we do here) is the only viable structural compression.
Pruning Decision
Uniform 109 experts per layer (19 dropped per layer), based on the clean teacher-force aggregate ranking. The router proj.weight is resized from [128, hidden] to [109, hidden], keeping only the rows corresponding to retained experts. The router naturally adapts — removed experts simply become unavailable and the top-8 selection falls through to the next-best available expert. No fine-tuning needed.
The drop map used to build this model is deterministic and stored in expert_drop_metadata.json.
GPQA Diamond Evaluation
Setup
All variants are evaluated with the same canonical script (scripts/eval_gpqa_v3.sh) for apples-to-apples comparison:
- Quantization: GGUF Q6_K via llama.cpp
llama-quantize(imatrix calibration) - Inference: llama.cpp
llama-server(OpenAI-compatible API) - Evaluation: lm-evaluation-harness, task
gpqa_diamond_cot_zeroshot - Backend:
local-chat-completionsagainst llama.cpp API - GPU: NVIDIA RTX 3090 (24 GB), 99 layers offloaded
Configuration (locked, all variants identical)
| Parameter | Value |
|---|---|
| Context size | 32768 tokens |
| Reasoning format | deepseek (separates thinking into reasoning_content) |
| Reasoning budget | 16384 tokens |
| max_gen_toks | 24576 |
| Temperature | 1.0 (Gemma 4 official sampling) |
| top_p | 0.95 |
| top_k | 64 |
| Seed | 42 |
| DRY multiplier | 0.5 (anti-degenerate-loop sampler, proven to fix Q53 "re-re-re" crash) |
| Tokenizer | google/gemma-4-26B-A4B-it (original, unmodified) |
The reasoning budget is critical: without it, Gemma 4 enters overthinking loops on hard questions and exhausts the full context without committing to an answer. This is base-model behavior — the 128-expert reference does the same.
Results (Q6_K, full 198-question GPQA Diamond)
| Model | Drop map | Score | flex-extract % | vs 128e |
|---|---|---|---|---|
| gemma-4-26B-A4B-it (reference) | — | 162/198 | 81.82% | — |
| gemma-4-A4B-109e-v3 (this) | clean fp32 teacher-force | 159/198 | 80.30% | −1.52 pp |
| gemma-4-A4B-109e-v2 (old v2) | corrupted bf16 teacher-force | 156/198 | 78.79% | −3.03 pp |
| gemma-4-A4B-120e-v4 | corrupted bf16 teacher-force | 154/198 | 77.78% | −4.04 pp |
| gemma-4-A4B-120e-v3 | old greedy | 152/198 | 76.77% | −5.05 pp |
| gemma-4-A4B-109e (older) | old greedy contribution | 148/198 | 74.75% | −7.07 pp |
Patch Methodology
The raw lm-eval run on this model scored 154/198 = 77.78%. 10 questions returned empty or truncated responses because of a llama.cpp PEG parser bug (upstream issue #21418, merged but not fully fixed) that triggers on some chemistry/reasoning questions when reasoning-format deepseek is active.
These 10 were re-run using the llama.cpp /completion endpoint with a short prefilled continuation ("The answer is (") at n_predict=10. This bypasses the reasoning parser entirely — the model commits to a single letter in 1-2 tokens, with no channel/thought markers to trip up the parser.
Patch result: 5 of 10 were correct (expected 2.5 if random; the model has real signal on the "missing" questions when given a path to answer them). Final patched score: 159/198 = 80.30%.
The same patch method was applied consistently to all compared variants above (128e, 109e-v2, 120e-v4) — so the ranking is fair.
Wrong-answer breakdown (Q6_K, flexible-extract)
- 128 correct from raw lm-eval + 5 patched = 159 correct
- 29 wrong
- 10 would-be-invalid → patched (5 correct, 5 wrong)
Architecture
Unchanged from the original except num_experts: 109 (was 128):
- Layers: 30
- Hidden size: 2816
- Expert intermediate size: 704 per expert
- Dense MLP intermediate size: 2112 (always active)
- Top-k routing: 8
- Attention: Hybrid sliding (5) + global (1) pattern,
head_dim=512(requiresattn_implementation="eager"— FlashAttention 2 does not support head_dim > 256) - Vocabulary: 262,144
Files
config.json— Model config withnum_experts: 109model-0000N-of-00009.safetensors— Model weights (9 shards, 41.7 GB total bf16)expert_drop_metadata.json— Per-layer keep/drop expert indices and methodologytokenizer.json/tokenizer_config.json/chat_template.jinja— from the base 26B-A4B-it (unchanged)
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"ManniX-ITA/gemma-4-A4B-109e-v3-it",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="eager", # Gemma 4 head_dim=512 is not supported by FA2
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-A4B-109e-v3-it")
msgs = [{"role": "user", "content": "Explain the Heisenberg uncertainty principle."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=1.0, top_p=0.95, top_k=64)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
llama.cpp (recommended for consumer hardware)
GGUF quantizations are available at ManniX-ITA/gemma-4-A4B-109e-v3-it-GGUF. The Q6_K quant fits comfortably in ~19 GB VRAM and was used for all benchmarks above.
llama-server -m gemma-4-A4B-109e-v3-Q6_K.gguf \
--port 8099 -c 32768 -ngl 99 \
--reasoning-format deepseek --reasoning-budget 16384 \
--temp 1.0 --top-p 0.95 --top-k 64 --dry-multiplier 0.5 --seed 42
Or convert locally:
python llama.cpp/convert_hf_to_gguf.py gemma-4-A4B-109e-v3-it --outfile model-f16.gguf --outtype f16
llama.cpp/build/bin/llama-quantize model-f16.gguf model-Q6_K.gguf Q6_K
Reproduction
The full pipeline is deterministic and bit-reproducible. Same base model + same drop map + same script = bit-identical safetensors (verified across two independent rebuilds at different sites, all 9 shards SHA256-matched):
scripts/teacher_force_analysis.py— Clean fp32 per-expert contribution analysis on GPQA Diamondscripts/generate_drop_map.py— Aggregate per-question top-16 protections into a global drop mapscripts/expert_drop.py— Deterministic expert pruning from the drop mapscripts/eval_gpqa_v3.sh— Canonical locked-methodology evaluation via llama.cpp + lm-eval
License
This model inherits the Gemma license from the base model.
Acknowledgements
- Google for the base Gemma 4 26B-A4B-it model
- The GPQA Diamond benchmark (Rein et al., 2023)
- bartowski for the calibration data v5 used in imatrix-based GGUF quantization
- Downloads last month
- 34