Gemma 4 A4B 109-Expert v3 (22.4B)

Expert-pruned version of google/gemma-4-26B-A4B-it, reduced from 128 to 109 experts per MoE layer using a clean fp32 teacher-force analysis on GPQA Diamond.

Original (128e) This model (109e v3) Delta
Total params 26B 22.4B -14%
Experts per layer 128 109 -19/layer
Top-k routing 8 8
GGUF Q6_K size 18 GB 18 GB
bf16 size 50 GB 42 GB -16%
GPQA Diamond (Q6_K, lm-eval + patches) 81.82% 80.30% −1.52 pp

The v3 clean teacher-force drop only costs 1.52 percentage points vs the full 128-expert reference at Q6_K.

Why v3?

Earlier 109e / 109e-v2 releases were built from a drop map that turned out to have ~43% wrong expert selections due to a bf16 → .norm() overflow bug in the teacher-force analysis script. The bug produced NaN/inf entries in the per-expert norms of layers 11-29 and corrupted the ranking used to pick which experts to drop.

v3 fixes it:

  • .float().norm() for the bf16→fp32 reduction (eliminates overflow)
  • NaN guard on hidden states (skip upstream-NaN tokens)
  • 4096-token truncation on calibration inputs
  • attn_implementation="eager" (Gemma 4 head_dim=512 is not supported by FlashAttention 2)
  • Analysis re-run over the full GPQA Diamond 196-question set on the 128e reference, producing a new clean drop map (teacher_force_109e_p16_clean.json).

The v3 clean drop map differs on roughly 43% of expert selections compared to v2 — it is effectively a different model. And at Q6_K it beats v2 by +1.51 pp on GPQA Diamond (80.30% vs 78.79%).

Pruning Method

Teacher-Force Expert Analysis

The pruning decision is based on measuring the actual contribution of each expert during teacher-forced inference on GPQA Diamond prompts using the full 128-expert reference model as the teacher.

Process (scripts/teacher_force_analysis.py):

  1. Teacher passes: For each of 196 GPQA Diamond questions, run the 128e reference model through the complete prompt + correct CoT + answer sequence in a single forward pass with teacher forcing.
  2. Per-expert norms: At every MoE layer, hook the experts module and recompute each activated expert's output ||routing_weight · expert_output||_2 on GPU, aggregated in fp32.
  3. Per-question top-16 protection: For each question independently, rank experts per layer by weighted norm and mark the top 16 as "protected for this question".
  4. Aggregate across questions: Union the per-question top-16 sets across all 196 GPQA questions. The top 109 experts per layer by coverage are kept; the bottom 19 are dropped.

This gives a drop map specifically optimized for the GPQA Diamond task domain while remaining deterministic and reproducible.

Key Findings (clean fp32 analysis)

  • Experts are NOT topic-specialized: confirmed across domains.
  • The bf16 bug mattered: ~2% of per-expert norm entries in layers 11-29 were NaN or inf in the corrupted analysis, dragging those experts to artificially extreme ranks. Fixing to fp32 changes 43% of the "protected top-16 per question" decisions.
  • Expert weight similarity is near zero: cosine similarity between expert weight matrices maxes at ~0.05 — merging experts by averaging destroys the model. Expert dropping (what we do here) is the only viable structural compression.

Pruning Decision

Uniform 109 experts per layer (19 dropped per layer), based on the clean teacher-force aggregate ranking. The router proj.weight is resized from [128, hidden] to [109, hidden], keeping only the rows corresponding to retained experts. The router naturally adapts — removed experts simply become unavailable and the top-8 selection falls through to the next-best available expert. No fine-tuning needed.

The drop map used to build this model is deterministic and stored in expert_drop_metadata.json.

GPQA Diamond Evaluation

Setup

All variants are evaluated with the same canonical script (scripts/eval_gpqa_v3.sh) for apples-to-apples comparison:

  • Quantization: GGUF Q6_K via llama.cpp llama-quantize (imatrix calibration)
  • Inference: llama.cpp llama-server (OpenAI-compatible API)
  • Evaluation: lm-evaluation-harness, task gpqa_diamond_cot_zeroshot
  • Backend: local-chat-completions against llama.cpp API
  • GPU: NVIDIA RTX 3090 (24 GB), 99 layers offloaded

Configuration (locked, all variants identical)

Parameter Value
Context size 32768 tokens
Reasoning format deepseek (separates thinking into reasoning_content)
Reasoning budget 16384 tokens
max_gen_toks 24576
Temperature 1.0 (Gemma 4 official sampling)
top_p 0.95
top_k 64
Seed 42
DRY multiplier 0.5 (anti-degenerate-loop sampler, proven to fix Q53 "re-re-re" crash)
Tokenizer google/gemma-4-26B-A4B-it (original, unmodified)

The reasoning budget is critical: without it, Gemma 4 enters overthinking loops on hard questions and exhausts the full context without committing to an answer. This is base-model behavior — the 128-expert reference does the same.

Results (Q6_K, full 198-question GPQA Diamond)

Model Drop map Score flex-extract % vs 128e
gemma-4-26B-A4B-it (reference) 162/198 81.82%
gemma-4-A4B-109e-v3 (this) clean fp32 teacher-force 159/198 80.30% −1.52 pp
gemma-4-A4B-109e-v2 (old v2) corrupted bf16 teacher-force 156/198 78.79% −3.03 pp
gemma-4-A4B-120e-v4 corrupted bf16 teacher-force 154/198 77.78% −4.04 pp
gemma-4-A4B-120e-v3 old greedy 152/198 76.77% −5.05 pp
gemma-4-A4B-109e (older) old greedy contribution 148/198 74.75% −7.07 pp

Patch Methodology

The raw lm-eval run on this model scored 154/198 = 77.78%. 10 questions returned empty or truncated responses because of a llama.cpp PEG parser bug (upstream issue #21418, merged but not fully fixed) that triggers on some chemistry/reasoning questions when reasoning-format deepseek is active.

These 10 were re-run using the llama.cpp /completion endpoint with a short prefilled continuation ("The answer is (") at n_predict=10. This bypasses the reasoning parser entirely — the model commits to a single letter in 1-2 tokens, with no channel/thought markers to trip up the parser.

Patch result: 5 of 10 were correct (expected 2.5 if random; the model has real signal on the "missing" questions when given a path to answer them). Final patched score: 159/198 = 80.30%.

The same patch method was applied consistently to all compared variants above (128e, 109e-v2, 120e-v4) — so the ranking is fair.

Wrong-answer breakdown (Q6_K, flexible-extract)

  • 128 correct from raw lm-eval + 5 patched = 159 correct
  • 29 wrong
  • 10 would-be-invalid → patched (5 correct, 5 wrong)

Architecture

Unchanged from the original except num_experts: 109 (was 128):

  • Layers: 30
  • Hidden size: 2816
  • Expert intermediate size: 704 per expert
  • Dense MLP intermediate size: 2112 (always active)
  • Top-k routing: 8
  • Attention: Hybrid sliding (5) + global (1) pattern, head_dim=512 (requires attn_implementation="eager" — FlashAttention 2 does not support head_dim > 256)
  • Vocabulary: 262,144

Files

  • config.json — Model config with num_experts: 109
  • model-0000N-of-00009.safetensors — Model weights (9 shards, 41.7 GB total bf16)
  • expert_drop_metadata.json — Per-layer keep/drop expert indices and methodology
  • tokenizer.json / tokenizer_config.json / chat_template.jinja — from the base 26B-A4B-it (unchanged)

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "ManniX-ITA/gemma-4-A4B-109e-v3-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",  # Gemma 4 head_dim=512 is not supported by FA2
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-A4B-109e-v3-it")

msgs = [{"role": "user", "content": "Explain the Heisenberg uncertainty principle."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=1.0, top_p=0.95, top_k=64)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

llama.cpp (recommended for consumer hardware)

GGUF quantizations are available at ManniX-ITA/gemma-4-A4B-109e-v3-it-GGUF. The Q6_K quant fits comfortably in ~19 GB VRAM and was used for all benchmarks above.

llama-server -m gemma-4-A4B-109e-v3-Q6_K.gguf \
    --port 8099 -c 32768 -ngl 99 \
    --reasoning-format deepseek --reasoning-budget 16384 \
    --temp 1.0 --top-p 0.95 --top-k 64 --dry-multiplier 0.5 --seed 42

Or convert locally:

python llama.cpp/convert_hf_to_gguf.py gemma-4-A4B-109e-v3-it --outfile model-f16.gguf --outtype f16
llama.cpp/build/bin/llama-quantize model-f16.gguf model-Q6_K.gguf Q6_K

Reproduction

The full pipeline is deterministic and bit-reproducible. Same base model + same drop map + same script = bit-identical safetensors (verified across two independent rebuilds at different sites, all 9 shards SHA256-matched):

  1. scripts/teacher_force_analysis.py — Clean fp32 per-expert contribution analysis on GPQA Diamond
  2. scripts/generate_drop_map.py — Aggregate per-question top-16 protections into a global drop map
  3. scripts/expert_drop.py — Deterministic expert pruning from the drop map
  4. scripts/eval_gpqa_v3.sh — Canonical locked-methodology evaluation via llama.cpp + lm-eval

License

This model inherits the Gemma license from the base model.

Acknowledgements

  • Google for the base Gemma 4 26B-A4B-it model
  • The GPQA Diamond benchmark (Rein et al., 2023)
  • bartowski for the calibration data v5 used in imatrix-based GGUF quantization
Downloads last month
34
Safetensors
Model size
22B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-109e-v3-it

Finetuned
(54)
this model
Quantizations
1 model