- Gemma 4 A4B 98-Expert v4 (20.8B)
Gemma 4 A4B 98-Expert v4 (20.8B)
Expert-pruned version of google/gemma-4-26B-A4B-it, reduced from 128 to 98 experts per MoE layer using a multi-class ContribDynamic (CD) drop map with max-over-classes aggregation.
| Original (128e) | 98e v3 | 98e v4 (this model) | |
|---|---|---|---|
| Total params | 26B | 20.8B | ~20.8B |
| Experts per layer | 128 | 98 | 98 |
| Experts dropped | — | 30/layer | 30/layer |
| MoE capacity removed | — | 23.4% | 23.4% |
| Top-k routing | 8 | 8 | 8 |
| Drop map source | — | TF pooled, p16 | multi-class CD-map (max), p16 |
Eval results for v4 are FINAL (updated 2026-05-14 09:10 CEST). A canonical vLLM-NVFP4A16 9-bench suite has been run locally with apples-to-apples 128e baselines on the same hardware / pipeline / quant family. All 9 v4 benches complete; 128e is fully complete. Full scoreboard, per-bench token distributions, anomaly inspection, and the per-domain analysis of the GPQA-Diamond delta are in § Canonical vLLM NVFP4A16 evaluation below. The previous pod-rented v4 numbers (HE 92.68%, MBPP 76.40%, LCB 61.82%) were discarded due to settings drift and token-budget truncation; only the canonical local runs are cited here.
GGUF quantizations available at ManniX-ITA/gemma-4-A4B-98e-v4-it-GGUF — full Bartowski tier sweep + ContribDynamic (CD) per-layer quants. Built with the same imatrix calibration (bartowski v5) as v3.
What changed vs v3
v3 ranked experts with a single pooled score:
score[layer][expert] = Σ_q (wnorm_q × α + tc_q) α=2.0
aggregating across all teacher-force questions in scripts/per_question_teacher_force.json. The bottom 30 per layer were dropped, with protect_top=16 shielding the strongest experts. This produced 75.25% GPQA Diamond (Q6_K) — full parity with the 128e reference.
v4 uses the same set of experts to drop if it gets the same answer — but the signal used to rank them is different. Instead of pooling across questions, the 30 experts dropped per layer are chosen via max-over-classes of normalized per-class CD scores from scripts/expert_neuron_v4.json:
# 5 task classes: math / logic / code / science / creative
for class in classes:
s[class][layer][expert] = wnorm × α + tc # per-class score
normalize by per-(class, layer) mean # so class scales are comparable
score[layer][expert] = max over classes(s[class][layer][expert])
The reasoning: an expert that specializes in one class (e.g. fires only on math) gets a low score under v3's pooled aggregation but a high score under v4's max-over-classes — because at least one class strongly relies on it. The hypothesis: GPQA spans physics / biology / chemistry, so multi-domain coverage from preserved specialists should reward the v4 ranking over v3's averaging.
Recipe scripts:
scripts/generate_drop_map_multiclass.py— producescd_multiclass_98e_max_drop_map.json(this model's drop map). CLI exposes--strategy max|mean|geomean|sumfor ablations.scripts/expert_drop.py— the deterministic drop applier (unchanged from v3).scripts/build_98e_cd.sh— end-to-end build pipeline (drop → F16 → Q6_K → GGUF upload).
All scripts are in omnimergekit (the canonical home for OmniMergeKit experiments).
Why max-over-classes?
A geomean ablation (--strategy geomean) was also tested. Result: catastrophic collapse (HE 24.39% vs v4 max's expected 92-95%). Geomean penalizes any expert with a low score in even one class, which collides with how MoE specialization actually works — each layer routes by class, not all at once. Max-over-classes preserves the strongest specialist; geomean preserves only generalists, and there aren't many.
This finding is now a project-level rule:
Optimizer / aggregator off-manifold pathology: aggregators that find a deeper-CE or lower-aggregate-score don't always transfer to autoregressive generation. Always gate model surgery on a generation-mode canary; prefer max / percentile over mean / geomean for importance.
(Source: memory/feedback_optimizer_off_manifold.md.)
Pruning Method (mechanical detail)
Identical to v3 with one substitution — a different drop map. The downstream surgery is the same:
- Read drop map JSON:
{layer_str: [list-of-30-dropped-expert-ids]}. - For each layer, slice the expert tensors (
gate_proj,up_proj,down_proj) keeping only the 98 retained experts. - Resize the MoE router
proj.weightfrom[128, hidden]to[98, hidden], keeping only rows for retained experts. Top-8 routing naturally adapts. - Update
config.json: num_experts = 98. - Convert HF → F16 GGUF → quantize → bartowski-style imatrix calibration on
calibration_datav5.txt.
Canonical vLLM NVFP4A16 evaluation (2026-05-13/14) — 9/9 v4 benches complete
Status (2026-05-14 09:10 CEST): All 9 v4 benches complete; 128e baseline fully complete. Same hardware (RTX 3090), same
vllm==0.1.dev16519, same NVFP4A16 quant family (modelopt_fp4), apples-to-apples per-template recipe viaomk_eval.Sampling: temperature 0, top_p 1, top_k 0 (greedy). LCB / HumanEval use
parallel=2. Greedy at temperature 0 still has a small ~±2pp per-problem variance from continuous-batching scheduler nondeterminism.Per-variant LCB recipe (asymmetric): 128e uses no reasoning parser (
enable_thinking=false) — the canonical recipe that gives 50/55 (90.91%). v4 uses parser=gemma4 + enable_thinking=true + thinking_token_budget=12288: without the budget v4 enters rumination loops on hard problems (lcb/leetcode/3566→ 53k-char output, 63 duplicatepythonfences,finish_reason=length). With the budget the same problem solves in ~1.2k chars. Asymmetric recipe lives inlcb_medium_55_v4.yaml.
Scores
| Bench (n) | 128e NVFP4A16 | 98e-v4 NVFP4A16 | Δ |
|---|---|---|---|
| ARC-Challenge full (1172) | 96.25% | 95.99% | −0.26pp |
| GPQA Diamond (198) | 73.23% | 69.19% | −4.04pp |
| GSM8K-100 (100) | 95.00% | 93.00% | −2.00pp |
| MATH-500-100 (100) | 92.00% | 92.00% | 0.00pp |
| AIME 2024 (30) | 73.33% | 66.67% | −6.67pp (= 2 problems) |
| IFEval-100 (100, prompt-strict) | 94.00% | 87.00% | −7.00pp |
| HumanEval (164) | 99.39% | 96.95% | −2.44pp |
| HumanEval+ (164) | 90.24% | 91.46% | +1.22pp ← v4 wins |
| LCB-medium (55) | 89.09% (49/55, no-parser) | 78.18% (43/55, budgeted recipe) | −10.91pp |
Per-bench output token distribution (approx, raw-chars ÷ 4)
The completion-token envelope for both variants is essentially identical across benches — v4 doesn't truncate, ramble, or shift its reasoning length. Where v4 differs from 128e, it gets wrong answers in the same length of output.
| Bench | Variant | p10 | p50 | p90 | max | mean | total tokens |
|---|---|---|---|---|---|---|---|
| ARC-Challenge | 128e | 207 | 315 | 429 | 1674 | 321 | 376,276 |
| ARC-Challenge | v4 | 206 | 310 | 427 | 16351* | 332 | 390,113 |
| GPQA Diamond | 128e | 492 | 625 | 799 | 2814 | 681 | 134,972 |
| GPQA Diamond | v4 | 501 | 617 | 760 | 2662 | 679 | 134,546 |
| GSM8K-100 | 128e | 33 | 64 | 186 | 266 | 82 | 8,228 |
| GSM8K-100 | v4 | 33 | 61 | 147 | 381 | 76 | 7,608 |
| MATH-500-100 | 128e | 142 | 289 | 434 | 1882 | 309 | 30,911 |
| MATH-500-100 | v4 | 145 | 277 | 485 | 2632 | 314 | 31,466 |
| AIME 2024 | 128e | 369 | 490 | 2160 | 2943 | 695 | 20,856 |
| AIME 2024 | v4 | 375 | 509 | 2631 | 3387 | 848 | 25,442 |
| IFEval-100 | 128e | 31 | 218 | 919 | 3265 | 362 | 36,230 |
| IFEval-100 | v4 | 33 | 211 | 942 | 4599 | 428 | 42,877 |
| HumanEval | 128e | 68 | 166 | 314 | 624 | 185 | 30,359 |
| HumanEval | v4 | 68 | 156 | 298 | 2846 | 193 | 31,702 |
| HumanEval+ | 128e | 63 | 155 | 314 | 903 | 176 | 28,938 |
| HumanEval+ | v4 | 68 | 146 | 296 | 1909 | 182 | 29,861 |
| LCB-medium | 128e | 336 | 2094 | 6307 | 16384 | 3397 | 186,845 |
| LCB-medium | v4 | 5477* | 12759 | 16384 | 16384 | 13178 | 724,815 |
* v4's LCB-medium completion lengths are 3.9× longer than 128e's at the median (12,759 vs 2,094 tokens). p50 lands within 471 tokens of the thinking budget cap (12,288), and 12/55 (21.8%) responses hit the 16,384-token max_gen_toks ceiling (finish_reason=length). v4's −10.91pp deficit on LCB-medium is driven by these "thinking too long, never reaching final answer" cases — exactly the symptom the asymmetric parser=gemma4 + thinking_token_budget=12288 recipe was deployed to bound. The same model at 8k thinking budget (Q6_K supplementary table below) scores identically 43/55, so raising the cap further does not help: the 12 length-capped completions are on wrong reasoning trajectories that don't converge.
* One ARC-Challenge problem produced a 65,406-char degenerate reasoning loop (1 of 1,172, scored correctly anyway since the answer letter was still emitted). All other 1,171 ARC responses are bounded under 6k chars.
Anomalies inspected
A per-bench sanity sweep of the v4 samples_*.jsonl turned up:
| Bench | Issue | Count | Impact |
|---|---|---|---|
| GSM8K-100 | silent-empty output (model emits zero tokens) | 5/100 | −5pp (would otherwise be ~98%) |
| HumanEval | silent-empty output on get_max_triples (doc 147) |
1/164 | −0.6pp |
| HumanEval+ | silent-empty output on HumanEval/107 (different task than HE-164's HumanEval/147 — stochastic) |
1/164 | −0.6pp |
| IFEval-100 | constraint-collision rumination on doc 26 ("French, no 'nourriture'") — locked in "l'alimentation de l'alimentation..." loop |
1/100 | −1pp |
| IFEval-100 | format-mismatch shorts (real model errors, not eval bugs) | 3/100 | counted as wrong correctly |
The silent-empty pattern (greedy decoding emits <eos> as first token) is a real v4 model-behavior failure that doesn't exist on 128e under identical recipe. It accounts for most of the GSM8K and HumanEval delta to 128e. A min_tokens=1 sampler probe is queued.
GPQA-Diamond domain breakdown — the −4.04pp deficit is entirely Chemistry
| High-level domain | n | 128e | v4 | Δpp |
|---|---|---|---|---|
| Physics | 86 | 88.4% | 88.4% | 0.0 |
| Chemistry | 93 | 62.4% | 54.8% | −7.5 |
| Biology | 19 | 57.9% | 52.6% | −5.3 (1 problem, noise) |
Subdomain detail
| Subdomain | n | 128e | v4 | Δpp |
|---|---|---|---|---|
| High-energy particle physics | 14 | 100.0% | 100.0% | 0.0 |
| Quantum Mechanics | 25 | 92.0% | 92.0% | 0.0 |
| Physics (general) | 19 | 84.2% | 84.2% | 0.0 |
| Astrophysics | 13 | 76.9% | 84.6% | +7.7 |
| Electromagnetism and Photonics | 6 | 83.3% | 83.3% | 0.0 |
| Molecular Biology | 15 | 60.0% | 60.0% | 0.0 |
| Relativistic Mechanics | 7 | 85.7% | 71.4% | −14.3 (1 problem, noise) |
| Organic Chemistry | 72 | 58.3% | 51.4% | −6.9 |
| Chemistry (general) | 20 | 75.0% | 65.0% | −10.0 |
Per-problem agreement matrix (GPQA-Diamond)
| both correct | both wrong | only 128e right | only v4 right | |
|---|---|---|---|---|
| count | 122 (61.6%) | 38 (19.2%) | 23 (v4 LOST) | 15 (v4 GAINED) |
Of the 23 problems v4 lost: 18 Chemistry, 4 Physics, 1 Biology. Of the 15 v4 rescued: 11 Chemistry, 4 Physics. Physics losses and gains balance exactly (4=4); Chemistry takes 18 losses against 11 gains for the net −7.
Interpretation
The v4 CD-map multi-class drop scored experts via max-over-normalized-classes across math / logic / code / science / creative. Within "science", chemistry and physics knowledge live in distinct expert sets, but the score lumps them together. Physics experts fire broadly across the science class; narrow organic-chemistry specialists score lower in the aggregate even when their peak class-specific contribution is large. The 30-expert-per-layer prune at protect_top=16 then drops a non-negligible fraction of those chemistry specialists.
Practical implications:
- For physics / quantum / math reasoning, v4 ≈ 128e (zero degradation on the 124 physics+QM+HEP+astro problems; astro +7.7pp is within noise but indicative).
- For organic / general chemistry, v4 is 7-10pp behind 128e on the 92-problem chemistry subset.
- A future variant with a finer-grained CD-map (splitting "science" into chemistry / physics / biology classes before the max aggregation) could likely recover most of those 18 chemistry losses without disturbing the physics tie.
Filed as a follow-up experiment in omnimergekit (task T14.3b: subdomain-resolved CD-map ablation).
Supplementary llama.cpp Q6_K snapshot (2026-05-11) — tainted, do not cite
These numbers are appended for archival reference only and DO NOT supersede the previously published evaluations above. They were collected with a llama.cpp build whose Gemma 4 support is incomplete; multiple known issues (reasoning-token leaks,
<unused>flood, fence drift, cache-reuse gaps) are still active in this serving path. The proper way to read this section is "what numbers does the llama.cpp pipeline produce today", not "what is the true capability of the model".
- llama.cpp tag/build:
b9095-2-g0b04728(build 590), CUDA backend on RTX 3090. - Serving profile:
--reasoning-format deepseek --reasoning-budget 8192, chat-completions endpoint, q8_0 KV cache,--parallel 2. - Sampler: greedy for HE/MBPP/LCB (
--temp 0 --top-p 1 --top-k 0 --seed 42), gemma4 preset for GPQA (T=1.0 / top_p=0.95 / top_k=64). - Scoring: lm-evaluation-harness
--apply_chat_template,--use_cache,--log_samples, no fence-strip rescore applied here (chat-completions cleanly returnsresps). - A vLLM 4-bit cross-validation of these same benchmarks is scheduled; if vLLM agrees, the deltas below are model-level; if not, the deltas are pipeline-level and the canonical (published) numbers stand.
Scores
| Bench (n) | 128e Q6_K | 98e-v3 Q6_K | 98e-v4 (cd-max) Q6_K |
|---|---|---|---|
| HumanEval-chat @3072 (164) | 97.56% | 73.78% | 96.34% |
| MBPP-chat (500) | 79.20% ±1.82 | — (no clean chat run) | 76.00% ±1.91 |
| LCB-medium @8k (55) | 87.27% (48/55) | — | 78.18% (43/55) |
| LCB-medium @16k (55) | in flight | — | 78.18% (43/55) |
| GPQA-Diamond (198) | 65.66% ±3.38 (v4-proto) | 75.25% ±3.07 (legacy proto) | 77.27% ±2.99 |
The 128e GPQA at 65.66% under the v4 protocol is well below the canonical 75.25% Q6_K and is the clearest evidence here that the llama.cpp serving path is the failure mode, not the model. The 75.25% legacy number remains the value to cite for 128e. The 98e-v3 row reuses the published GPQA result (run under the legacy llama.cpp protocol) for completeness; it is not re-measured here.
Output tokens per response (samples-derived)
LCB tokens are real (completion_tokens from the patched runner). HE / MBPP / GPQA are approximated as len(resps) / 4 from samples_*.jsonl; this is within ±15% of true Gemma 4 tokens.
| Bench | Variant | median | mean | p95 | max | total (n × mean) |
|---|---|---|---|---|---|---|
| HE-chat | 128e | 313 | 334 | 681 | 917 | 54.9k |
| HE-chat | 98e-v3 | 490 | 512 | 953 | 1013 | 84.0k |
| HE-chat | 98e-v4 | 303 | 340 | 755 | 895 | 55.9k |
| MBPP-chat | 128e | 194 | 224 | 453 | 532 | 112k |
| MBPP-chat | 98e-v3 | 129 | 356 | 1328 | 1892 | 178k (raw-protocol outliers) |
| MBPP-chat | 98e-v4 | 165 | 206 | 455 | 530 | 103k |
| LCB-med @8k | 128e | 1174 | 2167 | 7949 | 8192 | 119k |
| LCB-med @8k | 98e-v4 | 2829 | 3667 | 8192 | 8192 | 202k |
| LCB-med @16k | 98e-v4 | 2829 | 4913 | 15983 | 16064 | 270k |
| GPQA-D | 128e | 749 | 948 | 2098 | 4576 | 375k |
| GPQA-D | 98e-v3 | 648 | 655 | 815 | 1076 | 250k |
| GPQA-D | 98e-v4 | 676 | 783 | 950 | 5495 | 305k |
Raising the LCB cap from 8k to 16k for 98e-v4 did not change the score (43/55 both times). The cap was binding for 22/55 problems at 8k but every truncated answer was already on a wrong trajectory; the failures are real, not truncation.
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"ManniX-ITA/gemma-4-A4B-98e-v4-it",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="eager", # Gemma 4 head_dim=512 not supported by FA2
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-A4B-98e-v4-it")
msgs = [{"role": "user", "content": "Explain the Heisenberg uncertainty principle."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=1.0, top_p=0.95, top_k=64)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
llama.cpp (recommended)
llama-server -m gemma-4-A4B-98e-v4-it-Q6_K.gguf \
--port 8099 -c 32768 -ngl 99 --no-warmup \
--jinja --reasoning-format deepseek --reasoning-budget 16384 \
--temp 1.0 --top-p 0.95 --top-k 64 --dry-multiplier 0.5
For greedy code-generation tasks, replace the sampler with --temp 0 --top-p 1 --top-k 0 --reasoning off.
GGUF quants: ManniX-ITA/gemma-4-A4B-98e-v4-it-GGUF
Related Models
| Model | Description |
|---|---|
| gemma-4-A4B-98e-v3-it | 98e — pooled TF map (v3 recipe), 75.25% GPQA |
| gemma-4-A4B-98e-v3-it-GGUF | GGUF quants for v3 |
| gemma-4-A4B-109e-v3-it | 109e — 19/layer drop with the same pooled v3 map |
| gemma-4-A4B-98e-v4-it-GGUF | GGUF quants for this model |
Recipe + Code
OmniMergeKit is the canonical home for these experiments. The relevant artifacts are:
eval/EVAL_PROTOCOL.md— locked methodology for every eval (HE / MBPP / MPE / LCB / GPQA), with per-task sampler / context / cap settings, mandatory cache + log-samples + post-run sanity checklist.eval/lcb/lcb_llama_server.py— patched LCB runner (resumable, records prompt_tokens / completion_tokens / finish_reason; defaultmax_tokens=8192after truncation analysis).eval/scripts/eval_gpqa_v3.sh— GPQA Diamond canonical wrapper with multi-arch sampler family presets.eval/scripts/multipl_e_generate.py— MultiPL-E generator with retry-on-5xx (server errors are never silently dropped).
License
This model inherits the Gemma license from the base model.
Acknowledgements
- Google for the base Gemma 4 26B-A4B-it model
- The GPQA Diamond benchmark (Rein et al., 2023)
- bartowski for the calibration data v5 used in imatrix GGUF quantization
- The OmniMergeKit project for the merge / surgery toolkit
- Downloads last month
- 142
Model tree for ManniX-ITA/gemma-4-A4B-98e-v4-it
Base model
google/gemma-4-26B-A4B