Gemma 4 A4B 98-Expert v4 (20.8B)

Expert-pruned version of google/gemma-4-26B-A4B-it, reduced from 128 to 98 experts per MoE layer using a multi-class ContribDynamic (CD) drop map with max-over-classes aggregation.

	Original (128e)	98e v3	98e v4 (this model)
Total params	26B	20.8B	~20.8B
Experts per layer	128	98	98
Experts dropped	—	30/layer	30/layer
MoE capacity removed	—	23.4%	23.4%
Top-k routing	8	8	8
Drop map source	—	TF pooled, p16	multi-class CD-map (max), p16

Eval results for v4 are FINAL (updated 2026-05-14 09:10 CEST). A canonical vLLM-NVFP4A16 9-bench suite has been run locally with apples-to-apples 128e baselines on the same hardware / pipeline / quant family. All 9 v4 benches complete; 128e is fully complete. Full scoreboard, per-bench token distributions, anomaly inspection, and the per-domain analysis of the GPQA-Diamond delta are in § Canonical vLLM NVFP4A16 evaluation below. The previous pod-rented v4 numbers (HE 92.68%, MBPP 76.40%, LCB 61.82%) were discarded due to settings drift and token-budget truncation; only the canonical local runs are cited here.

GGUF quantizations available at ManniX-ITA/gemma-4-A4B-98e-v4-it-GGUF — full Bartowski tier sweep + ContribDynamic (CD) per-layer quants. Built with the same imatrix calibration (bartowski v5) as v3.

What changed vs v3

v3 ranked experts with a single pooled score:

score[layer][expert] = Σ_q (wnorm_q × α + tc_q)    α=2.0

aggregating across all teacher-force questions in scripts/per_question_teacher_force.json. The bottom 30 per layer were dropped, with protect_top=16 shielding the strongest experts. This produced 75.25% GPQA Diamond (Q6_K) — full parity with the 128e reference.

v4 uses the same set of experts to drop if it gets the same answer — but the signal used to rank them is different. Instead of pooling across questions, the 30 experts dropped per layer are chosen via max-over-classes of normalized per-class CD scores from scripts/expert_neuron_v4.json:

# 5 task classes: math / logic / code / science / creative
for class in classes:
    s[class][layer][expert] = wnorm × α + tc      # per-class score
    normalize by per-(class, layer) mean          # so class scales are comparable
score[layer][expert] = max over classes(s[class][layer][expert])

The reasoning: an expert that specializes in one class (e.g. fires only on math) gets a low score under v3's pooled aggregation but a high score under v4's max-over-classes — because at least one class strongly relies on it. The hypothesis: GPQA spans physics / biology / chemistry, so multi-domain coverage from preserved specialists should reward the v4 ranking over v3's averaging.

Recipe scripts:

scripts/generate_drop_map_multiclass.py — produces cd_multiclass_98e_max_drop_map.json (this model's drop map). CLI exposes --strategy max|mean|geomean|sum for ablations.
scripts/expert_drop.py — the deterministic drop applier (unchanged from v3).
scripts/build_98e_cd.sh — end-to-end build pipeline (drop → F16 → Q6_K → GGUF upload).

All scripts are in omnimergekit (the canonical home for OmniMergeKit experiments).

Why max-over-classes?

A geomean ablation (--strategy geomean) was also tested. Result: catastrophic collapse (HE 24.39% vs v4 max's expected 92-95%). Geomean penalizes any expert with a low score in even one class, which collides with how MoE specialization actually works — each layer routes by class, not all at once. Max-over-classes preserves the strongest specialist; geomean preserves only generalists, and there aren't many.

This finding is now a project-level rule:

Optimizer / aggregator off-manifold pathology: aggregators that find a deeper-CE or lower-aggregate-score don't always transfer to autoregressive generation. Always gate model surgery on a generation-mode canary; prefer max / percentile over mean / geomean for importance.

(Source: memory/feedback_optimizer_off_manifold.md.)

Pruning Method (mechanical detail)

Identical to v3 with one substitution — a different drop map. The downstream surgery is the same:

Read drop map JSON: {layer_str: [list-of-30-dropped-expert-ids]}.
For each layer, slice the expert tensors (gate_proj, up_proj, down_proj) keeping only the 98 retained experts.
Resize the MoE router proj.weight from [128, hidden] to [98, hidden], keeping only rows for retained experts. Top-8 routing naturally adapts.
Update config.json: num_experts = 98.
Convert HF → F16 GGUF → quantize → bartowski-style imatrix calibration on calibration_datav5.txt.

Canonical vLLM NVFP4A16 evaluation (2026-05-13/14) — 9/9 v4 benches complete

Status (2026-05-14 09:10 CEST): All 9 v4 benches complete; 128e baseline fully complete. Same hardware (RTX 3090), same vllm==0.1.dev16519, same NVFP4A16 quant family (modelopt_fp4), apples-to-apples per-template recipe via omk_eval.

Sampling: temperature 0, top_p 1, top_k 0 (greedy). LCB / HumanEval use parallel=2. Greedy at temperature 0 still has a small ~±2pp per-problem variance from continuous-batching scheduler nondeterminism.

Per-variant LCB recipe (asymmetric): 128e uses no reasoning parser (enable_thinking=false) — the canonical recipe that gives 50/55 (90.91%). v4 uses parser=gemma4 + enable_thinking=true + thinking_token_budget=12288: without the budget v4 enters rumination loops on hard problems (lcb/leetcode/3566 → 53k-char output, 63 duplicate python fences, finish_reason=length). With the budget the same problem solves in ~1.2k chars. Asymmetric recipe lives in lcb_medium_55_v4.yaml.

Scores

Bench (n)	128e NVFP4A16	98e-v4 NVFP4A16	Δ
ARC-Challenge full (1172)	96.25%	95.99%	−0.26pp
GPQA Diamond (198)	73.23%	69.19%	−4.04pp
GSM8K-100 (100)	95.00%	93.00%	−2.00pp
MATH-500-100 (100)	92.00%	92.00%	0.00pp
AIME 2024 (30)	73.33%	66.67%	−6.67pp (= 2 problems)
IFEval-100 (100, prompt-strict)	94.00%	87.00%	−7.00pp
HumanEval (164)	99.39%	96.95%	−2.44pp
HumanEval+ (164)	90.24%	91.46%	+1.22pp ← v4 wins
LCB-medium (55)	89.09% (49/55, no-parser)	78.18% (43/55, budgeted recipe)	−10.91pp

Per-bench output token distribution (approx, raw-chars ÷ 4)

The completion-token envelope for both variants is essentially identical across benches — v4 doesn't truncate, ramble, or shift its reasoning length. Where v4 differs from 128e, it gets wrong answers in the same length of output.

Bench	Variant	p10	p50	p90	max	mean	total tokens
ARC-Challenge	128e	207	315	429	1674	321	376,276
ARC-Challenge	v4	206	310	427	16351*	332	390,113
GPQA Diamond	128e	492	625	799	2814	681	134,972
GPQA Diamond	v4	501	617	760	2662	679	134,546
GSM8K-100	128e	33	64	186	266	82	8,228
GSM8K-100	v4	33	61	147	381	76	7,608
MATH-500-100	128e	142	289	434	1882	309	30,911
MATH-500-100	v4	145	277	485	2632	314	31,466
AIME 2024	128e	369	490	2160	2943	695	20,856
AIME 2024	v4	375	509	2631	3387	848	25,442
IFEval-100	128e	31	218	919	3265	362	36,230
IFEval-100	v4	33	211	942	4599	428	42,877
HumanEval	128e	68	166	314	624	185	30,359
HumanEval	v4	68	156	298	2846	193	31,702
HumanEval+	128e	63	155	314	903	176	28,938
HumanEval+	v4	68	146	296	1909	182	29,861
LCB-medium	128e	336	2094	6307	16384	3397	186,845
LCB-medium	v4	5477*	12759	16384	16384	13178	724,815

* v4's LCB-medium completion lengths are 3.9× longer than 128e's at the median (12,759 vs 2,094 tokens). p50 lands within 471 tokens of the thinking budget cap (12,288), and 12/55 (21.8%) responses hit the 16,384-token max_gen_toks ceiling (finish_reason=length). v4's −10.91pp deficit on LCB-medium is driven by these "thinking too long, never reaching final answer" cases — exactly the symptom the asymmetric parser=gemma4 + thinking_token_budget=12288 recipe was deployed to bound. The same model at 8k thinking budget (Q6_K supplementary table below) scores identically 43/55, so raising the cap further does not help: the 12 length-capped completions are on wrong reasoning trajectories that don't converge.

* One ARC-Challenge problem produced a 65,406-char degenerate reasoning loop (1 of 1,172, scored correctly anyway since the answer letter was still emitted). All other 1,171 ARC responses are bounded under 6k chars.

Anomalies inspected

A per-bench sanity sweep of the v4 samples_*.jsonl turned up:

Bench	Issue	Count	Impact
GSM8K-100	silent-empty output (model emits zero tokens)	5/100	−5pp (would otherwise be ~98%)
HumanEval	silent-empty output on `get_max_triples` (doc 147)	1/164	−0.6pp
HumanEval+	silent-empty output on `HumanEval/107` (different task than HE-164's HumanEval/147 — stochastic)	1/164	−0.6pp
IFEval-100	constraint-collision rumination on `doc 26` ("French, no 'nourriture'") — locked in `"l'alimentation de l'alimentation..."` loop	1/100	−1pp
IFEval-100	format-mismatch shorts (real model errors, not eval bugs)	3/100	counted as wrong correctly

The silent-empty pattern (greedy decoding emits <eos> as first token) is a real v4 model-behavior failure that doesn't exist on 128e under identical recipe. It accounts for most of the GSM8K and HumanEval delta to 128e. A min_tokens=1 sampler probe is queued.

GPQA-Diamond domain breakdown — the −4.04pp deficit is entirely Chemistry

High-level domain	n	128e	v4	Δpp
Physics	86	88.4%	88.4%	0.0
Chemistry	93	62.4%	54.8%	−7.5
Biology	19	57.9%	52.6%	−5.3 (1 problem, noise)

Subdomain detail

Subdomain	n	128e	v4	Δpp
High-energy particle physics	14	100.0%	100.0%	0.0
Quantum Mechanics	25	92.0%	92.0%	0.0
Physics (general)	19	84.2%	84.2%	0.0
Astrophysics	13	76.9%	84.6%	+7.7
Electromagnetism and Photonics	6	83.3%	83.3%	0.0
Molecular Biology	15	60.0%	60.0%	0.0
Relativistic Mechanics	7	85.7%	71.4%	−14.3 (1 problem, noise)
Organic Chemistry	72	58.3%	51.4%	−6.9
Chemistry (general)	20	75.0%	65.0%	−10.0

Per-problem agreement matrix (GPQA-Diamond)

	both correct	both wrong	only 128e right	only v4 right
count	122 (61.6%)	38 (19.2%)	23 (v4 LOST)	15 (v4 GAINED)

Of the 23 problems v4 lost: 18 Chemistry, 4 Physics, 1 Biology. Of the 15 v4 rescued: 11 Chemistry, 4 Physics. Physics losses and gains balance exactly (4=4); Chemistry takes 18 losses against 11 gains for the net −7.

Interpretation

The v4 CD-map multi-class drop scored experts via max-over-normalized-classes across math / logic / code / science / creative. Within "science", chemistry and physics knowledge live in distinct expert sets, but the score lumps them together. Physics experts fire broadly across the science class; narrow organic-chemistry specialists score lower in the aggregate even when their peak class-specific contribution is large. The 30-expert-per-layer prune at protect_top=16 then drops a non-negligible fraction of those chemistry specialists.

Practical implications:

For physics / quantum / math reasoning, v4 ≈ 128e (zero degradation on the 124 physics+QM+HEP+astro problems; astro +7.7pp is within noise but indicative).
For organic / general chemistry, v4 is 7-10pp behind 128e on the 92-problem chemistry subset.
A future variant with a finer-grained CD-map (splitting "science" into chemistry / physics / biology classes before the max aggregation) could likely recover most of those 18 chemistry losses without disturbing the physics tie.

Filed as a follow-up experiment in omnimergekit (task T14.3b: subdomain-resolved CD-map ablation).

Supplementary llama.cpp Q6_K snapshot (2026-05-11) — tainted, do not cite

These numbers are appended for archival reference only and DO NOT supersede the previously published evaluations above. They were collected with a llama.cpp build whose Gemma 4 support is incomplete; multiple known issues (reasoning-token leaks, <unused> flood, fence drift, cache-reuse gaps) are still active in this serving path. The proper way to read this section is "what numbers does the llama.cpp pipeline produce today", not "what is the true capability of the model".

llama.cpp tag/build: b9095-2-g0b04728 (build 590), CUDA backend on RTX 3090.
Serving profile: --reasoning-format deepseek --reasoning-budget 8192, chat-completions endpoint, q8_0 KV cache, --parallel 2.
Sampler: greedy for HE/MBPP/LCB (--temp 0 --top-p 1 --top-k 0 --seed 42), gemma4 preset for GPQA (T=1.0 / top_p=0.95 / top_k=64).
Scoring: lm-evaluation-harness --apply_chat_template, --use_cache, --log_samples, no fence-strip rescore applied here (chat-completions cleanly returns resps).
A vLLM 4-bit cross-validation of these same benchmarks is scheduled; if vLLM agrees, the deltas below are model-level; if not, the deltas are pipeline-level and the canonical (published) numbers stand.

Scores

Bench (n)	128e Q6_K	98e-v3 Q6_K	98e-v4 (cd-max) Q6_K
HumanEval-chat @3072 (164)	97.56%	73.78%	96.34%
MBPP-chat (500)	79.20% ±1.82	— (no clean chat run)	76.00% ±1.91
LCB-medium @8k (55)	87.27% (48/55)	—	78.18% (43/55)
LCB-medium @16k (55)	in flight	—	78.18% (43/55)
GPQA-Diamond (198)	65.66% ±3.38 (v4-proto)	75.25% ±3.07 (legacy proto)	77.27% ±2.99

The 128e GPQA at 65.66% under the v4 protocol is well below the canonical 75.25% Q6_K and is the clearest evidence here that the llama.cpp serving path is the failure mode, not the model. The 75.25% legacy number remains the value to cite for 128e. The 98e-v3 row reuses the published GPQA result (run under the legacy llama.cpp protocol) for completeness; it is not re-measured here.

Output tokens per response (samples-derived)

LCB tokens are real (completion_tokens from the patched runner). HE / MBPP / GPQA are approximated as len(resps) / 4 from samples_*.jsonl; this is within ±15% of true Gemma 4 tokens.

Bench	Variant	median	mean	p95	max	total (n × mean)
HE-chat	128e	313	334	681	917	54.9k
HE-chat	98e-v3	490	512	953	1013	84.0k
HE-chat	98e-v4	303	340	755	895	55.9k
MBPP-chat	128e	194	224	453	532	112k
MBPP-chat	98e-v3	129	356	1328	1892	178k (raw-protocol outliers)
MBPP-chat	98e-v4	165	206	455	530	103k
LCB-med @8k	128e	1174	2167	7949	8192	119k
LCB-med @8k	98e-v4	2829	3667	8192	8192	202k
LCB-med @16k	98e-v4	2829	4913	15983	16064	270k
GPQA-D	128e	749	948	2098	4576	375k
GPQA-D	98e-v3	648	655	815	1076	250k
GPQA-D	98e-v4	676	783	950	5495	305k

Raising the LCB cap from 8k to 16k for 98e-v4 did not change the score (43/55 both times). The cap was binding for 22/55 problems at 8k but every truncated answer was already on a wrong trajectory; the failures are real, not truncation.

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "ManniX-ITA/gemma-4-A4B-98e-v4-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",  # Gemma 4 head_dim=512 not supported by FA2
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-A4B-98e-v4-it")

msgs = [{"role": "user", "content": "Explain the Heisenberg uncertainty principle."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=1.0, top_p=0.95, top_k=64)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

llama.cpp (recommended)

llama-server -m gemma-4-A4B-98e-v4-it-Q6_K.gguf \
    --port 8099 -c 32768 -ngl 99 --no-warmup \
    --jinja --reasoning-format deepseek --reasoning-budget 16384 \
    --temp 1.0 --top-p 0.95 --top-k 64 --dry-multiplier 0.5

For greedy code-generation tasks, replace the sampler with --temp 0 --top-p 1 --top-k 0 --reasoning off.

GGUF quants: ManniX-ITA/gemma-4-A4B-98e-v4-it-GGUF

Related Models

Model	Description
gemma-4-A4B-98e-v3-it	98e — pooled TF map (v3 recipe), 75.25% GPQA
gemma-4-A4B-98e-v3-it-GGUF	GGUF quants for v3
gemma-4-A4B-109e-v3-it	109e — 19/layer drop with the same pooled v3 map
gemma-4-A4B-98e-v4-it-GGUF	GGUF quants for this model

Recipe + Code

OmniMergeKit is the canonical home for these experiments. The relevant artifacts are:

eval/EVAL_PROTOCOL.md — locked methodology for every eval (HE / MBPP / MPE / LCB / GPQA), with per-task sampler / context / cap settings, mandatory cache + log-samples + post-run sanity checklist.
eval/lcb/lcb_llama_server.py — patched LCB runner (resumable, records prompt_tokens / completion_tokens / finish_reason; default max_tokens=8192 after truncation analysis).
eval/scripts/eval_gpqa_v3.sh — GPQA Diamond canonical wrapper with multi-arch sampler family presets.
eval/scripts/multipl_e_generate.py — MultiPL-E generator with retry-on-5xx (server errors are never silently dropped).