Gemma 4 A4B 98-Expert v4 (20.8B)

Expert-pruned version of google/gemma-4-26B-A4B-it, reduced from 128 to 98 experts per MoE layer using a multi-class ContribDynamic (CD) drop map with max-over-classes aggregation.

Original (128e) 98e v3 98e v4 (this model)
Total params 26B 20.8B ~20.8B
Experts per layer 128 98 98
Experts dropped 30/layer 30/layer
MoE capacity removed 23.4% 23.4%
Top-k routing 8 8 8
Drop map source TF pooled, p16 multi-class CD-map (max), p16

Eval results for v4 are FINAL (updated 2026-05-14 09:10 CEST). A canonical vLLM-NVFP4A16 9-bench suite has been run locally with apples-to-apples 128e baselines on the same hardware / pipeline / quant family. All 9 v4 benches complete; 128e is fully complete. Full scoreboard, per-bench token distributions, anomaly inspection, and the per-domain analysis of the GPQA-Diamond delta are in § Canonical vLLM NVFP4A16 evaluation below. The previous pod-rented v4 numbers (HE 92.68%, MBPP 76.40%, LCB 61.82%) were discarded due to settings drift and token-budget truncation; only the canonical local runs are cited here.

GGUF quantizations available at ManniX-ITA/gemma-4-A4B-98e-v4-it-GGUF — full Bartowski tier sweep + ContribDynamic (CD) per-layer quants. Built with the same imatrix calibration (bartowski v5) as v3.

What changed vs v3

v3 ranked experts with a single pooled score:

score[layer][expert] = Σ_q (wnorm_q × α + tc_q)    α=2.0

aggregating across all teacher-force questions in scripts/per_question_teacher_force.json. The bottom 30 per layer were dropped, with protect_top=16 shielding the strongest experts. This produced 75.25% GPQA Diamond (Q6_K) — full parity with the 128e reference.

v4 uses the same set of experts to drop if it gets the same answer — but the signal used to rank them is different. Instead of pooling across questions, the 30 experts dropped per layer are chosen via max-over-classes of normalized per-class CD scores from scripts/expert_neuron_v4.json:

# 5 task classes: math / logic / code / science / creative
for class in classes:
    s[class][layer][expert] = wnorm × α + tc      # per-class score
    normalize by per-(class, layer) mean          # so class scales are comparable
score[layer][expert] = max over classes(s[class][layer][expert])

The reasoning: an expert that specializes in one class (e.g. fires only on math) gets a low score under v3's pooled aggregation but a high score under v4's max-over-classes — because at least one class strongly relies on it. The hypothesis: GPQA spans physics / biology / chemistry, so multi-domain coverage from preserved specialists should reward the v4 ranking over v3's averaging.

Recipe scripts:

  • scripts/generate_drop_map_multiclass.py — produces cd_multiclass_98e_max_drop_map.json (this model's drop map). CLI exposes --strategy max|mean|geomean|sum for ablations.
  • scripts/expert_drop.py — the deterministic drop applier (unchanged from v3).
  • scripts/build_98e_cd.sh — end-to-end build pipeline (drop → F16 → Q6_K → GGUF upload).

All scripts are in omnimergekit (the canonical home for OmniMergeKit experiments).

Why max-over-classes?

A geomean ablation (--strategy geomean) was also tested. Result: catastrophic collapse (HE 24.39% vs v4 max's expected 92-95%). Geomean penalizes any expert with a low score in even one class, which collides with how MoE specialization actually works — each layer routes by class, not all at once. Max-over-classes preserves the strongest specialist; geomean preserves only generalists, and there aren't many.

This finding is now a project-level rule:

Optimizer / aggregator off-manifold pathology: aggregators that find a deeper-CE or lower-aggregate-score don't always transfer to autoregressive generation. Always gate model surgery on a generation-mode canary; prefer max / percentile over mean / geomean for importance.

(Source: memory/feedback_optimizer_off_manifold.md.)

Pruning Method (mechanical detail)

Identical to v3 with one substitution — a different drop map. The downstream surgery is the same:

  1. Read drop map JSON: {layer_str: [list-of-30-dropped-expert-ids]}.
  2. For each layer, slice the expert tensors (gate_proj, up_proj, down_proj) keeping only the 98 retained experts.
  3. Resize the MoE router proj.weight from [128, hidden] to [98, hidden], keeping only rows for retained experts. Top-8 routing naturally adapts.
  4. Update config.json: num_experts = 98.
  5. Convert HF → F16 GGUF → quantize → bartowski-style imatrix calibration on calibration_datav5.txt.

Canonical vLLM NVFP4A16 evaluation (2026-05-13/14) — 9/9 v4 benches complete

Status (2026-05-14 09:10 CEST): All 9 v4 benches complete; 128e baseline fully complete. Same hardware (RTX 3090), same vllm==0.1.dev16519, same NVFP4A16 quant family (modelopt_fp4), apples-to-apples per-template recipe via omk_eval.

Sampling: temperature 0, top_p 1, top_k 0 (greedy). LCB / HumanEval use parallel=2. Greedy at temperature 0 still has a small ~±2pp per-problem variance from continuous-batching scheduler nondeterminism.

Per-variant LCB recipe (asymmetric): 128e uses no reasoning parser (enable_thinking=false) — the canonical recipe that gives 50/55 (90.91%). v4 uses parser=gemma4 + enable_thinking=true + thinking_token_budget=12288: without the budget v4 enters rumination loops on hard problems (lcb/leetcode/3566 → 53k-char output, 63 duplicate python fences, finish_reason=length). With the budget the same problem solves in ~1.2k chars. Asymmetric recipe lives in lcb_medium_55_v4.yaml.

Scores

Bench (n) 128e NVFP4A16 98e-v4 NVFP4A16 Δ
ARC-Challenge full (1172) 96.25% 95.99% −0.26pp
GPQA Diamond (198) 73.23% 69.19% −4.04pp
GSM8K-100 (100) 95.00% 93.00% −2.00pp
MATH-500-100 (100) 92.00% 92.00% 0.00pp
AIME 2024 (30) 73.33% 66.67% −6.67pp (= 2 problems)
IFEval-100 (100, prompt-strict) 94.00% 87.00% −7.00pp
HumanEval (164) 99.39% 96.95% −2.44pp
HumanEval+ (164) 90.24% 91.46% +1.22pp ← v4 wins
LCB-medium (55) 89.09% (49/55, no-parser) 78.18% (43/55, budgeted recipe) −10.91pp

Per-bench output token distribution (approx, raw-chars ÷ 4)

The completion-token envelope for both variants is essentially identical across benches — v4 doesn't truncate, ramble, or shift its reasoning length. Where v4 differs from 128e, it gets wrong answers in the same length of output.

Bench Variant p10 p50 p90 max mean total tokens
ARC-Challenge 128e 207 315 429 1674 321 376,276
ARC-Challenge v4 206 310 427 16351* 332 390,113
GPQA Diamond 128e 492 625 799 2814 681 134,972
GPQA Diamond v4 501 617 760 2662 679 134,546
GSM8K-100 128e 33 64 186 266 82 8,228
GSM8K-100 v4 33 61 147 381 76 7,608
MATH-500-100 128e 142 289 434 1882 309 30,911
MATH-500-100 v4 145 277 485 2632 314 31,466
AIME 2024 128e 369 490 2160 2943 695 20,856
AIME 2024 v4 375 509 2631 3387 848 25,442
IFEval-100 128e 31 218 919 3265 362 36,230
IFEval-100 v4 33 211 942 4599 428 42,877
HumanEval 128e 68 166 314 624 185 30,359
HumanEval v4 68 156 298 2846 193 31,702
HumanEval+ 128e 63 155 314 903 176 28,938
HumanEval+ v4 68 146 296 1909 182 29,861
LCB-medium 128e 336 2094 6307 16384 3397 186,845
LCB-medium v4 5477* 12759 16384 16384 13178 724,815

* v4's LCB-medium completion lengths are 3.9× longer than 128e's at the median (12,759 vs 2,094 tokens). p50 lands within 471 tokens of the thinking budget cap (12,288), and 12/55 (21.8%) responses hit the 16,384-token max_gen_toks ceiling (finish_reason=length). v4's −10.91pp deficit on LCB-medium is driven by these "thinking too long, never reaching final answer" cases — exactly the symptom the asymmetric parser=gemma4 + thinking_token_budget=12288 recipe was deployed to bound. The same model at 8k thinking budget (Q6_K supplementary table below) scores identically 43/55, so raising the cap further does not help: the 12 length-capped completions are on wrong reasoning trajectories that don't converge.

* One ARC-Challenge problem produced a 65,406-char degenerate reasoning loop (1 of 1,172, scored correctly anyway since the answer letter was still emitted). All other 1,171 ARC responses are bounded under 6k chars.

Anomalies inspected

A per-bench sanity sweep of the v4 samples_*.jsonl turned up:

Bench Issue Count Impact
GSM8K-100 silent-empty output (model emits zero tokens) 5/100 −5pp (would otherwise be ~98%)
HumanEval silent-empty output on get_max_triples (doc 147) 1/164 −0.6pp
HumanEval+ silent-empty output on HumanEval/107 (different task than HE-164's HumanEval/147 — stochastic) 1/164 −0.6pp
IFEval-100 constraint-collision rumination on doc 26 ("French, no 'nourriture'") — locked in "l'alimentation de l'alimentation..." loop 1/100 −1pp
IFEval-100 format-mismatch shorts (real model errors, not eval bugs) 3/100 counted as wrong correctly

The silent-empty pattern (greedy decoding emits <eos> as first token) is a real v4 model-behavior failure that doesn't exist on 128e under identical recipe. It accounts for most of the GSM8K and HumanEval delta to 128e. A min_tokens=1 sampler probe is queued.

GPQA-Diamond domain breakdown — the −4.04pp deficit is entirely Chemistry

High-level domain n 128e v4 Δpp
Physics 86 88.4% 88.4% 0.0
Chemistry 93 62.4% 54.8% −7.5
Biology 19 57.9% 52.6% −5.3 (1 problem, noise)

Subdomain detail

Subdomain n 128e v4 Δpp
High-energy particle physics 14 100.0% 100.0% 0.0
Quantum Mechanics 25 92.0% 92.0% 0.0
Physics (general) 19 84.2% 84.2% 0.0
Astrophysics 13 76.9% 84.6% +7.7
Electromagnetism and Photonics 6 83.3% 83.3% 0.0
Molecular Biology 15 60.0% 60.0% 0.0
Relativistic Mechanics 7 85.7% 71.4% −14.3 (1 problem, noise)
Organic Chemistry 72 58.3% 51.4% −6.9
Chemistry (general) 20 75.0% 65.0% −10.0

Per-problem agreement matrix (GPQA-Diamond)

both correct both wrong only 128e right only v4 right
count 122 (61.6%) 38 (19.2%) 23 (v4 LOST) 15 (v4 GAINED)

Of the 23 problems v4 lost: 18 Chemistry, 4 Physics, 1 Biology. Of the 15 v4 rescued: 11 Chemistry, 4 Physics. Physics losses and gains balance exactly (4=4); Chemistry takes 18 losses against 11 gains for the net −7.

Interpretation

The v4 CD-map multi-class drop scored experts via max-over-normalized-classes across math / logic / code / science / creative. Within "science", chemistry and physics knowledge live in distinct expert sets, but the score lumps them together. Physics experts fire broadly across the science class; narrow organic-chemistry specialists score lower in the aggregate even when their peak class-specific contribution is large. The 30-expert-per-layer prune at protect_top=16 then drops a non-negligible fraction of those chemistry specialists.

Practical implications:

  • For physics / quantum / math reasoning, v4 ≈ 128e (zero degradation on the 124 physics+QM+HEP+astro problems; astro +7.7pp is within noise but indicative).
  • For organic / general chemistry, v4 is 7-10pp behind 128e on the 92-problem chemistry subset.
  • A future variant with a finer-grained CD-map (splitting "science" into chemistry / physics / biology classes before the max aggregation) could likely recover most of those 18 chemistry losses without disturbing the physics tie.

Filed as a follow-up experiment in omnimergekit (task T14.3b: subdomain-resolved CD-map ablation).

Supplementary llama.cpp Q6_K snapshot (2026-05-11) — tainted, do not cite

These numbers are appended for archival reference only and DO NOT supersede the previously published evaluations above. They were collected with a llama.cpp build whose Gemma 4 support is incomplete; multiple known issues (reasoning-token leaks, <unused> flood, fence drift, cache-reuse gaps) are still active in this serving path. The proper way to read this section is "what numbers does the llama.cpp pipeline produce today", not "what is the true capability of the model".

  • llama.cpp tag/build: b9095-2-g0b04728 (build 590), CUDA backend on RTX 3090.
  • Serving profile: --reasoning-format deepseek --reasoning-budget 8192, chat-completions endpoint, q8_0 KV cache, --parallel 2.
  • Sampler: greedy for HE/MBPP/LCB (--temp 0 --top-p 1 --top-k 0 --seed 42), gemma4 preset for GPQA (T=1.0 / top_p=0.95 / top_k=64).
  • Scoring: lm-evaluation-harness --apply_chat_template, --use_cache, --log_samples, no fence-strip rescore applied here (chat-completions cleanly returns resps).
  • A vLLM 4-bit cross-validation of these same benchmarks is scheduled; if vLLM agrees, the deltas below are model-level; if not, the deltas are pipeline-level and the canonical (published) numbers stand.

Scores

Bench (n) 128e Q6_K 98e-v3 Q6_K 98e-v4 (cd-max) Q6_K
HumanEval-chat @3072 (164) 97.56% 73.78% 96.34%
MBPP-chat (500) 79.20% ±1.82 (no clean chat run) 76.00% ±1.91
LCB-medium @8k (55) 87.27% (48/55) 78.18% (43/55)
LCB-medium @16k (55) in flight 78.18% (43/55)
GPQA-Diamond (198) 65.66% ±3.38 (v4-proto) 75.25% ±3.07 (legacy proto) 77.27% ±2.99

The 128e GPQA at 65.66% under the v4 protocol is well below the canonical 75.25% Q6_K and is the clearest evidence here that the llama.cpp serving path is the failure mode, not the model. The 75.25% legacy number remains the value to cite for 128e. The 98e-v3 row reuses the published GPQA result (run under the legacy llama.cpp protocol) for completeness; it is not re-measured here.

Output tokens per response (samples-derived)

LCB tokens are real (completion_tokens from the patched runner). HE / MBPP / GPQA are approximated as len(resps) / 4 from samples_*.jsonl; this is within ±15% of true Gemma 4 tokens.

Bench Variant median mean p95 max total (n × mean)
HE-chat 128e 313 334 681 917 54.9k
HE-chat 98e-v3 490 512 953 1013 84.0k
HE-chat 98e-v4 303 340 755 895 55.9k
MBPP-chat 128e 194 224 453 532 112k
MBPP-chat 98e-v3 129 356 1328 1892 178k (raw-protocol outliers)
MBPP-chat 98e-v4 165 206 455 530 103k
LCB-med @8k 128e 1174 2167 7949 8192 119k
LCB-med @8k 98e-v4 2829 3667 8192 8192 202k
LCB-med @16k 98e-v4 2829 4913 15983 16064 270k
GPQA-D 128e 749 948 2098 4576 375k
GPQA-D 98e-v3 648 655 815 1076 250k
GPQA-D 98e-v4 676 783 950 5495 305k

Raising the LCB cap from 8k to 16k for 98e-v4 did not change the score (43/55 both times). The cap was binding for 22/55 problems at 8k but every truncated answer was already on a wrong trajectory; the failures are real, not truncation.

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "ManniX-ITA/gemma-4-A4B-98e-v4-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",  # Gemma 4 head_dim=512 not supported by FA2
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-A4B-98e-v4-it")

msgs = [{"role": "user", "content": "Explain the Heisenberg uncertainty principle."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=1.0, top_p=0.95, top_k=64)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

llama.cpp (recommended)

llama-server -m gemma-4-A4B-98e-v4-it-Q6_K.gguf \
    --port 8099 -c 32768 -ngl 99 --no-warmup \
    --jinja --reasoning-format deepseek --reasoning-budget 16384 \
    --temp 1.0 --top-p 0.95 --top-k 64 --dry-multiplier 0.5

For greedy code-generation tasks, replace the sampler with --temp 0 --top-p 1 --top-k 0 --reasoning off.

GGUF quants: ManniX-ITA/gemma-4-A4B-98e-v4-it-GGUF

Related Models

Model Description
gemma-4-A4B-98e-v3-it 98e — pooled TF map (v3 recipe), 75.25% GPQA
gemma-4-A4B-98e-v3-it-GGUF GGUF quants for v3
gemma-4-A4B-109e-v3-it 109e — 19/layer drop with the same pooled v3 map
gemma-4-A4B-98e-v4-it-GGUF GGUF quants for this model

Recipe + Code

OmniMergeKit is the canonical home for these experiments. The relevant artifacts are:

  • eval/EVAL_PROTOCOL.md — locked methodology for every eval (HE / MBPP / MPE / LCB / GPQA), with per-task sampler / context / cap settings, mandatory cache + log-samples + post-run sanity checklist.
  • eval/lcb/lcb_llama_server.py — patched LCB runner (resumable, records prompt_tokens / completion_tokens / finish_reason; default max_tokens=8192 after truncation analysis).
  • eval/scripts/eval_gpqa_v3.sh — GPQA Diamond canonical wrapper with multi-arch sampler family presets.
  • eval/scripts/multipl_e_generate.py — MultiPL-E generator with retry-on-5xx (server errors are never silently dropped).

License

This model inherits the Gemma license from the base model.

Acknowledgements

  • Google for the base Gemma 4 26B-A4B-it model
  • The GPQA Diamond benchmark (Rein et al., 2023)
  • bartowski for the calibration data v5 used in imatrix GGUF quantization
  • The OmniMergeKit project for the merge / surgery toolkit
Downloads last month
142
Safetensors
Model size
20B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-98e-v4-it

Finetuned
(92)
this model
Finetunes
1 model
Quantizations
1 model