Add LCB-medium-100 + MultiPL-E-100 (rs/java/js/macro) Q6_K llama.cpp table (pod 37268930)

d5c3cd3 verified 1 day ago

25.4 kB

	---
	base_model: ManniX-ITA/gemma-4-A4B-98e-v4-it
	tags:
	- gemma4
	- moe
	- expert-pruning
	- code
	- surgery
	- omnimergekit
	license: gemma
	---

	# Gemma 4 A4B 98-Expert v5-coder (20.8B) — code-leaning prune

	A research checkpoint that takes [98e v4](https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v4-it) and replaces its drop map with C6 layer-relevance-weighted v4-floor breadth=50 — a recipe that protects code/math experts more tightly per-layer than v4's multi-class CD-max. No shared FFN scaling. Same 98e shape, same router, same attention, same norms.

	## Quantized formats

	\| Format \| Repo \| Notes \|
	\|---\|---\|---\|
	\| GGUF (llama.cpp / ollama) \| [`ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF`](https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF) \| Full Bartowski tier sweep (Q2_K → Q8_0, IQ2-IQ4) + 5 ContribDynamic CD-* per-layer quants. F16 baseline included. \|
	\| NVFP4A16 (vLLM) \| [`ManniX-ITA/gemma-4-A4B-98e-v5-coder-NVFP4A16`](https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v5-coder-NVFP4A16) \| ~13 GB, native vLLM, produced via [`modelopt==0.43.0`](https://github.com/NVIDIA/TensorRT-Model-Optimizer). \|
	\| Ollama \| [`mannix/gemma4-98e-v5-coder`](https://ollama.com/mannix/gemma4-98e-v5-coder) \| Same GGUF tier sweep, ready for `ollama pull`. \|


	\| \| 98e v4 \| 98e v5-coder (this model) \|
	\|---\|---\|---\|
	\| Total params \| 20.8B \| ~20.8B \|
	\| Experts per layer \| 98 (30 dropped) \| 98 (30 dropped) \|
	\| Drop map \| multi-class CD-map (max), p16 \| C6 layer-relevance-weighted v4-floor, breadth=50 \|
	\| Shared FFN α \| 1.0 \| 1.0 (none) \|

	> Eval status — complete (9/9). ARC-Challenge was rescored 2026-05-18 on stack-pinned solidpc (stock vLLM 0.20.2 + Fix-A patched lm-eval) → 95.31 % ±0.62 pp, retiring the prior ⚠ from the [silent-empty Fix-A pathology](https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v4-it#anomalies-inspected). The original 12.37 % was 87.6 % `content=""` responses because lm-eval's stock `openai_completions.parse_generations` didn't fall back to `reasoning_content`.

	## Scoreboard — NVFP4A16, vLLM, greedy

	NVFP4A16 quant via [nvidia-modelopt 0.43.0](https://github.com/NVIDIA/TensorRT-Model-Optimizer), served via vLLM 0.20.2 with `--reasoning-parser gemma4`, `enable_thinking=true`, `thinking_token_budget=12288`. Sampler: greedy (`T=0, top_p=1, top_k=0, do_sample=false`) — the canonical Gemma 4 9-bench recipe.

	\| Bench (n) \| 128e ref \| 98e v4 \| 98e v5-coder \| Δ (v5-coder − v4) \|
	\|---\|---:\|---:\|---:\|---:\|
	\| ARC-Challenge-chat (1172) \| 95.99% \| 95.99% \| 95.31% \| −0.68 \|
	\| GPQA Diamond flex (198) \| 73.23% \| 69.19% \| 68.69% \| −0.50 \|
	\| GSM8K-100 flex \| 91.00% \| 86.00% \| 86.00% \| 0.00 \|
	\| MATH-500-100 math_verify \| 89.00% \| 89.00% \| 92.00% \| +3.00 \|
	\| AIME 2024 (30) \| 36.67% \| 36.67% \| 36.67% \| 0.00 \|
	\| IFEval-100 (prompt_strict) \| 95.00% \| 93.00% \| 94.00% \| +1.00 \|
	\| HumanEval-164 chat \| 96.95% \| 96.95% \| 98.17% \| +1.22 \|
	\| HumanEval+-164 chat \| 92.07% \| 91.46% \| 92.68% \| +1.22 \|
	\| LCB-medium-55 v4 \| 87.27% \| 78.18% \| 85.45% (47/55) \| +7.27 \|

	Reading the deltas: v5-coder is a deliberate code-leaning rewrite of v4's drop ranking. The C6 drop map protects per-layer code-relevance signal harder than v4's CD-max aggregation does — that shows up cleanly as `+1.22 / +1.22 / +7.27` on the three code benches (HE / HE+ / LCB-medium), with MATH-500 also recovering +3.00pp (math-on-text is correlated with code reasoning more than v4's drop assumed). Reasoning and general-knowledge benches are essentially flat: GPQA −0.50pp, GSM8K 0.00, AIME 0.00, IFEval +1.00. The big win is LCB-medium +7.27pp — that's well outside the ±2pp single-run noise floor on a 55-problem bench and matches the recipe's design intent (preserve code-specialist experts at the cost of nothing).

	ARC's prior −83.6pp gap (12.37% vs 95.99%) was not a v5-coder regression — it was the silent-empty Fix-A bug on the unpatched pod that ran it. 87.6% of the 1,172 ARC samples came back with empty `content` because vLLM 0.20.2 + Gemma 4 + reasoning-parser routes the answer to `reasoning_content` when the closing channel token isn't seen, and lm-eval's stock `parse_generations` reads `content` only. The model itself was fine; the eval harness wasn't patched. Stack-pinned rescore 2026-05-18 landed at 95.31 % — exactly inside the predicted 95–97 % band and within stderr of 128e (95.99 %) and v4 (95.99 %).

	## Scoreboard — Q6_K GGUF, llama.cpp, greedy

	A separate measurement from the NVFP4A16/vLLM table above — same model, different quant and different inference engine, so the numbers are not directly comparable to the vLLM column (notably AIME and GPQA differ by backend). This is the full 9-bench llama.cpp Q6_K run on a 3090 (pod 37268930, 2026-05-23), `llama-server --reasoning-format deepseek --reasoning-budget 12288`, greedy (`T=0, top_p=1, top_k=0`), `--parallel 2`. The 128e column is the bartowski Q6_K reference run under the identical recipe, so this table is apples-to-apples within the llama.cpp/Q6_K backend.

	\| Bench (n) \| 128e Q6_K \| v5-coder Q6_K \| Δ (v5-coder − 128e) \|
	\|---\|---:\|---:\|---:\|
	\| HumanEval-164 chat \| 96.34% \| 99.39% \| +3.05 \|
	\| HumanEval+-164 chat \| 90.85% \| 93.29% \| +2.44 \|
	\| LCB-medium-55 v4 \| 94.55% (52/55) \| 85.45% (47/55) \| −9.10 \|
	\| MATH-500-100 math_verify \| 94.00% \| 94.00% \| 0.00 \|
	\| IFEval-100 (prompt_strict) \| 97.00% \| 94.00% \| −3.00 \|
	\| GSM8K-100 flex \| 92.00% \| 87.00% \| −5.00 \|
	\| AIME 2024 (30) \| 83.33% \| 53.33% \| −30.00 \|
	\| ARC-Challenge-chat (1172) \| 97.10% \| 95.73% \| −1.37 \|
	\| GPQA Diamond flex (198) \| 72.73% \| 65.15% \| −7.58 \|

	Reading the Q6_K deltas: the code-leaning design intent shows on llama.cpp too — v5-coder beats the unpruned 128e on HumanEval (+3.05) and HumanEval+ (+2.44). The cost is everything else: pruning 30 experts/layer hits the hardest multi-step reasoning sharply (AIME −30.00, LCB-medium −9.10, GPQA −7.58) and trims general reasoning modestly (GSM8K −5.00, IFEval −3.00, ARC −1.37), while MATH-500 holds at parity. v5-coder is the right pick for HumanEval-style single-function code generation; for competition math / hard algorithmic coding the unpruned base is materially stronger.

	### Extended code benches — LCB-medium-100 + MultiPL-E-100 (Q6_K, 2026-05-24)

	Two broader code benches run under the identical Q6_K / llama.cpp / greedy recipe (pod 37268930), both first-class `omk_eval` backends (sqlite-resumable). The 128e column is the bartowski Q6_K reference under the same recipe — apples-to-apples within the llama.cpp/Q6_K backend.

	LCB-medium-100 — the 100-problem superset of the LCB-medium-55 v4 set above (contains all 55, plus 45 more medium/functional problems on a relaxed date window):

	\| Bench (n) \| 128e Q6_K \| v5-coder Q6_K \| Δ (v5-coder − 128e) \|
	\|---\|---:\|---:\|---:\|
	\| LCB-medium-100 \| 95.00% (95/100) \| 91.00% (91/100) \| −4.00 \|

	MultiPL-E-100 — HumanEval translated to Rust / Java / JavaScript, 100 problems per language (300 total), chat-mode generation + Markdown-code-block extraction (raw-completion mode degenerates on Gemma 4 reasoning models — see [`feedback_gemma4_chat_only_completions_breaks`](https://github.com/mann1x/omnimergekit/blob/main/memory/feedback_gemma4_chat_only_completions_breaks.md)). Macro-averaged over languages:

	\| Language (n=100) \| 128e Q6_K \| v5-coder Q6_K \| Δ (v5-coder − 128e) \|
	\|---\|---:\|---:\|---:\|
	\| Rust \| 83.00% \| 76.00% \| −7.00 \|
	\| Java \| 91.00% \| 81.00% \| −10.00 \|
	\| JavaScript \| 95.00% \| 86.00% \| −9.00 \|
	\| Macro mean \| 89.67% (269/300) \| 81.00% (243/300) \| −8.67 \|

	Reading these: on the broader LCB-medium-100 the pruning cost shrinks to −4.00pp (vs −9.10 on the harder 55 v4 subset) — v5-coder recovers most of the gap on the easier added 45 problems, landing at 91%. MultiPL-E exposes a clearer multi-language pruning cost (−7 to −10pp, fairly uniform across Rust/Java/JS), consistent with v5-coder's Python/HumanEval-leaning design: the code-specialist experts it preserves are strongest on Python-style single-function tasks, and the penalty is largest on the lower-resource target (Rust hardest, JavaScript easiest for both models).

	### HumanEval / HumanEval+ sanity audit

	98.17% / 92.68% sits at the top of the 14–22B band (see lazy comparison below), so the samples files were re-audited to rule out a scoring artifact. Audit script: `scripts/audit_v5coder_he.py`.

	\| Bench \| n \| score \| empty \| fenced \| chars p10/p50/p90 \| verbatim-canonical-in-gen \|
	\|---\|---:\|---:\|---:\|---:\|---\|---:\|
	\| HumanEval-164 \| 164 \| 0.9817 \| 1 \| 163 \| 270 / 642 / 1324 \| 3.0% (5/164) \|
	\| HumanEval+-164 \| 164 \| 0.9268 \| 3 \| 161 \| 270 / 620 / 1244 \| 1.8% (3/164) \|

	- Fences are stripped correctly. 163/164 (HE) and 161/164 (HE+) outputs are wrapped in `\`\`\`python` chat fences. lm-eval's chat-aware HE/HE+ scorer (built on the `humaneval_chat` shadow task — see [v4 card §Eval Caveat](https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v4-it)) extracts the function body before `exec()`. If fences were leaking through, pass@1 would collapse to ~0 (per [`feedback_gemma4_chat_only_completions_breaks.md`](https://github.com/mann1x/omnimergekit/blob/main/memory/feedback_gemma4_chat_only_completions_breaks.md)); 98.17% is only possible with correct stripping.
	- Empty-response rate is normal. 1/164 HE and 3/164 HE+ are blank — within Gemma 4 reasoning-mode noise; not the 87.6%-empty pathology that hit ARC.
	- No catastrophic contamination. Only 3.0% (HE) and 1.8% (HE+) of generations contain the canonical solution as a verbatim substring. A model that had memorized HE from pretraining would show 30%+; the few verbatim matches here are short structurally-inevitable solutions (e.g. `has_close_elements` O(n²) double-loop).
	- HE → HE+ delta is healthy. −5.49pp drop across the +/− boundary. HE+ adds adversarial test cases that catch brittle solutions which pass the public tests but fail edge cases. A 0pp drop would actually be a memorization red flag; ~−5pp is the expected band for a strong-but-not-memorized model.
	- Failures look real. The 3 HE failures are 1 empty (doc_id 122) plus 2 wrong-logic attempts (doc_id 140 `fix_spaces`, doc_id 145 `order_by_points`). The 12 HE+ failures are mostly "passes basic tests, fails edge cases" — exactly the regime HE+ exists to expose.

	Conclusion: 98.17 / 92.68 is real. Not a scoring artifact, not memorization, not silent-empty.

	### Lazy comparison vs the 14–22B coder field

	For sense-of-scale on whether 98.17 is anomalous. All numbers are official model-card / paper / blog (linked).

	Coder-specialized 14–22B:

	\| Model \| Params \| HE \| HE+ \| LCB (version) \| Source \|
	\|---\|---\|---:\|---:\|---\|---\|
	\| 98e v5-coder (this) \| 20.8B / 4B MoE \| 98.17 \| 92.68 \| 85.45 (LCB-medium-55 v4) \| _this card_ \|
	\| Qwen2.5-Coder-14B-Instruct \| 14.7B dense \| 89.6 \| 87.2 \| 23.4 (LCB 07/24–11/24, pre-v4) \| [arXiv:2409.12186](https://arxiv.org/pdf/2409.12186) \|
	\| DeepSeek-Coder-V2-Lite-Instruct \| 16B / 2.4B MoE \| 81.1 \| — \| 24.3 (LCB 12/01–06/01) \| [arXiv:2406.11931](https://arxiv.org/pdf/2406.11931) \|
	\| Codestral-22B v1 \| 22B dense \| 81.1 \| — \| (not published) \| [Mistral blog](https://mistral.ai/news/codestral) \|
	\| IBM Granite-20B-Code-Instruct \| 20B dense \| 60.4 \| — \| — \| [arXiv:2405.04324](https://arxiv.org/pdf/2405.04324) \|

	Generalist 14–22B (notable code scores):

	\| Model \| Params \| HE \| MATH \| GPQA-D \| IFEval \| Source \|
	\|---\|---\|---:\|---:\|---:\|---:\|---\|
	\| 98e v5-coder (this) \| 20.8B / 4B MoE \| 98.17 \| 92.00 (MATH-500) \| 68.69 \| 94.00 \| _this card_ \|
	\| Phi-4 \| 14B dense \| 82.6 \| 80.4 (MATH) \| 56.1 \| 63.0 \| [arXiv:2412.08905](https://arxiv.org/html/2412.08905v1) \|
	\| Qwen2.5-14B-Instruct \| 14.7B dense \| 81.7–86.2 \| 73.0 (MATH) \| 40.9 \| 80.0 \| [Qwen blog](https://qwenlm.github.io/blog/qwen2.5-llm/) \|
	\| Mistral-Small-3 (24B, just above band) \| 24B dense \| ~84 \| 70.6 (MATH) \| 45.3 \| 82.1 \| [Mistral blog](https://mistral.ai/news/mistral-small-3) \|

	Where v5-coder sits:

	- HE / HE+: top of the band, ~+8–10pp above Qwen2.5-Coder-14B's 89.6 / 87.2 (the published field leader). The audit above rules out scoring artifacts; the gap is real on this run.
	- LCB: not apples-to-apples with Qwen2.5-Coder or DS-Coder-V2-Lite. Those numbers are full LCB on pre-v4 problem windows (LCB-2024.07–11 and LCB-2024.12–06.01 respectively). v5-coder's 85.45% is LCB-medium-55 on v4 problems — a different subset and a different problem set. A fair comparison would require running Qwen2.5-Coder-14B on the same LCB-medium-55 v4 split, which nobody has published. Don't read +60pp into the LCB column.
	- MATH-500 92.00 / GSM8K 86 / AIME 36.67: top of the band for math-on-text reasoning. Phi-4's 80.4 MATH is the closest generalist; v5-coder beats it by ~12pp. AIME 36.67 is currently the only published 14–22B AIME score in this comparison set (Qwen2.5-Coder and Codestral don't evaluate AIME).
	- GPQA-Diamond 68.69 / IFEval 94.00: GPQA is materially above Phi-4 (56.1) and Qwen2.5-14B (40.9). IFEval 94 ties Mistral-Small-3 (82.1) and beats Phi-4 (63.0) — Phi-4's instruction-following is its known weakness.

	Caveats on this comparison: different labs use different system prompts, different `temperature`/`top_p`, different "chat vs base" framings, different sampling counts. v5-coder is run greedy (`T=0`); some published numbers (e.g. Phi-4) use multi-sample averaging. Within-card deltas (v4 vs v5-coder) are the cleanest signal; cross-card deltas are noisy by ±2-5pp.

	### Same-Stack GGUF HE+ Sweep — v5-coder vs Qwen2.5-Coder-14B-Instruct

	Head-to-head HumanEval+ (164-question, chat-aware shadow task) on identical hardware (single RTX 3090 24 GB) and identical eval recipe (llama-server `-c 32768 -ngl 99 --parallel 2 --jinja --reasoning off`, omk_eval llama backend, lm-eval `humaneval_plus_chat`, greedy `T=0`, `max_gen_toks=16384`). Qwen GGUFs are bartowski's [`Qwen2.5-Coder-14B-Instruct-GGUF`](https://huggingface.co/bartowski/Qwen2.5-Coder-14B-Instruct-GGUF).

	The "Lazy comparison" table above uses paper-reported numbers; this section is what the same rig and same scorer actually measure.

	#### v5-coder (20.8B total / 4B-active MoE) — plain quants

	\| Tier \| File size \| bpw \| HE+ pass@1 \|
	\|---\|---:\|---:\|---:\|
	\| Q2_K \| 8.40 GB \| 3.23 \| 6.10% (collapse) \|
	\| Q3_K_M \| 10.51 GB \| 4.04 \| 84.15% \|
	\| Q4_K_M \| 13.24 GB \| 5.09 \| 92.07% \|
	\| Q5_K_M \| 15.07 GB \| 5.80 \| 90.85% \|
	\| Q6_K \| 17.81 GB \| 6.85 \| 92.07% \|

	Q4_K_M is the recommended sweet spot. Q3_K_M loses ~8pp but is still usable; Q2_K collapses (an MoE-class artifact, not a v5-coder regression — plain Q2_K bytes are the cohort floor).

	#### v5-coder low-bpw IQ tier sweep (T64, in flight)

	Sub-4-bpw plain k-quants collapse on this MoE chassis (Q2_K at 3.23 bpw is the demonstration). The i-matrix codebook IQ-quants behave very differently — they survive much further down. Sweep is still landing on the same rig + recipe; rows fill in as tiers complete:

	\| Tier \| File size \| bpw \| HE+ pass@1 \|
	\|---\|---:\|---:\|---:\|
	\| IQ3_XXS \| 8.95 GB \| 3.44 \| 89.02% \|
	\| IQ3_XS \| 9.22 GB \| 3.55 \| 91.46% \|
	\| Q3_K_S \| 9.68 GB \| 3.72 \| 75.00% (collapse) \|
	\| IQ3_M \| 9.82 GB \| 3.78 \| 91.46% \|
	\| Q3_K_L \| 10.94 GB \| 4.21 \| 87.20% \|

	The IQ3_XXS result is the headline: at ~9 GB / 3.44 bpw it lands +4.87pp over plain Q3_K_M (10.51 GB / 4.04 bpw) while running ~1.6 GB smaller and ~0.6 bpw lower. The imatrix codebook protects the value-vector subspace through the sub-4-bpw zone where fixed-block k-quants would lose attention precision.

	#### Qwen2.5-Coder-14B-Instruct (14.7B dense) — bartowski quants

	\| Tier \| File size \| bpw \| HE+ pass@1 \|
	\|---\|---:\|---:\|---:\|
	\| IQ4_XS \| 8.12 GB \| 4.42 \| 84.76% \|
	\| Q4_0 \| 8.54 GB \| 4.65 \| 84.15% \|
	\| Q4_K_M \| 8.99 GB \| 4.89 \| 85.37% \|
	\| Q5_K_M \| 10.51 GB \| 5.72 \| 83.54% \|
	\| Q6_K \| 12.12 GB \| 6.60 \| 84.76% \|
	\| Q8_0 \| 15.70 GB \| 8.54 \| 84.76% \|

	Qwen sits at 83–85% across the whole tier ladder. The paper-reported 87.2 HE+ is ~2pp above what bartowski's GGUFs deliver on this stack — a known llama-server chat-template vs vLLM-temp=0 quirk, not a quant defect.

	#### Head-to-head by file size (v5-coder runs lower bpw at the same disk)

	Pairing by tier name is misleading here — v5-coder is a 20.8B-total MoE and Qwen is a 14.7B dense, so the same tier name maps to different file sizes. The fair comparison is iso-disk: at a given GB budget, which model wins HE+? At every band, v5-coder uses 2–3 bpw less than Qwen and still scores higher.

	\| Disk band \| Qwen2.5-Coder-14B (size / bpw / HE+) \| v5-coder (size / bpw / HE+) \| Δ HE+ \|
	\|---\|---\|---\|---:\|
	\| ~15 GB \| Q8_0 15.70 GB / 8.54 / 84.76% \| Q5_K_M 15.07 GB / 5.80 / 90.85% \| +6.09 \|
	\| ~12–13 GB \| Q6_K 12.12 GB / 6.60 / 84.76% \| Q4_K_M 13.24 GB / 5.09 / 92.07% \| +7.31 \|
	\| ~10.5 GB \| Q5_K_M 10.51 GB / 5.72 / 83.54% \| Q3_K_M 10.51 GB / 4.04 / 84.15% \| +0.61 \|
	\| ~9 GB \| Q4_K_M 8.99 GB / 4.89 / 85.37% \| Q2_K 8.40 GB / 3.23 / 6.10% (collapse) \| −79.27 \|

	The first three rows are the practical story: at ~15 GB Qwen's near-lossless Q8_0 loses 6pp to v5-coder Q5_K_M; at ~13 GB v5-coder Q4_K_M is +7.3pp over Qwen Q6_K; at ~10.5 GB even v5-coder Q3_K_M edges out Qwen Q5_K_M while running at 1.7 bpw less. The 4th row marks the floor — sub-Q3 MoE quants collapse, so the v5-coder ladder bottoms out at Q3_K_M / ~10 GB.

	> Pure tier-name matching (Qwen Q4_K_M vs v5-coder Q4_K_M etc.) would put v5-coder ~4 GB larger at every tier and ~+7pp ahead. That comparison is symmetric but unfair to Qwen's smaller footprint. The iso-disk view above is the one to plan VRAM around.

	> *CD- (ContribDynamic) tiers** are intentionally omitted from this table. Those are mid-rebuild after a 2026-05-19 patch closed a `--tensor-type-file` heuristic gap; they will be added once the rebuilt CD scores are confirmed.

	Run logs and samples live under `eval_results_hep_sweep/humanevalplus_full/` in the project tree.

	## What changed vs v4 (mechanical detail)

	Identical surgery flow to v4 with one substitution — a different drop map.

	1. Same base: `google/gemma-4-26B-A4B-it` (128e).
	2. Same drop count: 30 experts per layer (98 retained).
	3. Same `protect_top=16` shield.
	4. Different ranking signal: instead of `score[layer][expert] = max over normalized classes (wnorm·α + tc)` (v4), v5-coder scores each expert by a layer-relevance-weighted floor against v4's keep set, with breadth=50 controlling how many top-relevance experts get the floor lift before the bottom-30 cutoff is taken. The recipe scripts live in [omnimergekit](https://github.com/mann1x/omnimergekit) (T25 / T28 / T30 / C6 series — see `feedback_*` memory for the ablation history).
	5. Same downstream: slice expert tensors, resize MoE router `proj.weight` from `[128, hidden] → [98, hidden]`, update `config.json: num_experts=98`, GGUF conversion + quant pipeline unchanged.

	No shared FFN scaling (verified: `layers.0.mlp.down_proj.weight` is byte-for-byte identical to v4 in BF16).

	## When to pick which 98e variant

	\| Variant \| Lean \| Pick when \|
	\|---\|---\|---\|
	\| [v3](https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v3-it) \| pooled TF (no class signal) \| reference baseline; you want the original v3 numbers \|
	\| [v4](https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v4-it) \| balanced (5-class CD-map) \| general-purpose; first 98e you'd default to \|
	\| [v5](https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v5-it) \| v4 + shared FFN α=1.2 \| when you want v4 with a louder expert-mixture residual (research checkpoint) \|
	\| v5-coder (this) \| code-leaning C6 floor \| HumanEval / HumanEval+ / LCB / MultiPL-E workloads; +7pp on LCB-medium \|

	## Usage

	### Transformers

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	import torch

	model = AutoModelForCausalLM.from_pretrained(
	"ManniX-ITA/gemma-4-A4B-98e-v5-coder-it",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	attn_implementation="eager", # Gemma 4 head_dim=512 — FA2 not supported
	)
	tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-A4B-98e-v5-coder-it")

	msgs = [{"role": "user", "content": "Write a Python function that reverses a binary tree in-place."}]
	inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
	out = model.generate(inputs, max_new_tokens=2048, do_sample=False) # greedy, canonical recipe
	print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
	```

	### vLLM (NVFP4A16, canonical eval recipe)

	```bash
	python -m vllm.entrypoints.openai.api_server \
	--model ManniX-ITA/gemma-4-A4B-98e-v5-coder-NVFP4A16 \
	--served-model-name 98e_v5_coder_nvfp4a16 \
	--port 8099 \
	--gpu-memory-utilization 0.55 \
	--max-model-len 32768 \
	--max-num-batched-tokens 8192 \
	--dtype bfloat16 \
	--trust-remote-code \
	--reasoning-parser gemma4 \
	--default-chat-template-kwargs '{"enable_thinking": true}'
	```

	### llama.cpp (GGUF)

	```bash
	llama-server -m gemma-4-A4B-98e-v5-coder-it-Q6_K.gguf \
	--port 8099 -c 32768 -ngl 99 --no-warmup \
	--jinja --reasoning-format deepseek --reasoning-budget 12288 \
	--temp 0 --top-p 1 --top-k 0
	```

	GGUF quants: [ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF](https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF) — full Bartowski tier sweep (Q2_K → Q8_0, IQ2-IQ4) plus 5 ContribDynamic CD-* per-layer quants (CD-Q2_K, CD-Q3_K_M, CD-Q4_K_M, CD-Q5_K_M, CD-Q6_K). File naming: `gemma-4-A4B-98e-v5-coder-it-<TIER>.gguf`.

	### ollama

	```bash
	ollama pull mannix/gemma4-98e-v5-coder:Q6_K
	ollama run mannix/gemma4-98e-v5-coder:Q6_K
	```

	Available tiers: [`mannix/gemma4-98e-v5-coder`](https://ollama.com/mannix/gemma4-98e-v5-coder) — same set as the GGUF repo (`Q2_K` … `Q8_0`, `IQ2_` … `IQ4_`, `CD-Q2_K` … `CD-Q6_K`). Modelfile uses Gemma 4 tool/parser template (matches `mannix/gemma4-98e-v4` convention).

	## Related Models

	\| Model \| Description \|
	\|---\|---\|
	\| [gemma-4-A4B-98e-v5-coder-NVFP4A16](https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v5-coder-NVFP4A16) \| NVFP4A16 quant (~13 GB, vLLM-ready) \|
	\| [gemma-4-A4B-98e-v5-coder-it-GGUF](https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF) \| GGUF tier sweep + CD per-layer quants (llama.cpp / ollama) \|
	\| [mannix/gemma4-98e-v5-coder (Ollama)](https://ollama.com/mannix/gemma4-98e-v5-coder) \| Ollama-published version of the GGUF tier sweep \|
	\| [gemma-4-A4B-98e-v4-it](https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v4-it) \| The apples-to-apples baseline for this model \|
	\| [gemma-4-A4B-98e-v5-it](https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v5-it) \| Sibling: v4 + shared FFN α=1.2 \|
	\| [gemma-4-A4B-98e-v3-it](https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v3-it) \| Earlier baseline (pooled TF map) \|

	## Recipe + Code

	[OmniMergeKit](https://github.com/mann1x/omnimergekit) is the canonical home. The relevant artifacts for this model:

	- `scripts/v5coder_C6_v4floor_perlayer_breadth50_drop_map.json` — the drop map (embedded in `expert_drop_metadata.json`).
	- `scripts/expert_drop.py` — drop applier (unchanged across v3/v4/v5/v5-coder).
	- `eval/EVAL_PROTOCOL.md` — locked greedy methodology for the 9-bench suite, including the mandatory Fix-A patch for lm-eval's `openai_completions.parse_generations` (without it, ARC and other chat-completions benches silent-empty under Gemma 4 + reasoning-parser).

	## Eval Caveat — Fix-A is mandatory

	This model was evaluated on a pod whose lm-eval install was missing the Fix-A `reasoning_content` fallback patch in `openai_completions.parse_generations`. Under vLLM 0.20.2 + `--reasoning-parser gemma4`, Gemma 4 emits the answer to the message's `reasoning_content` field and leaves `content=""` whenever the parser doesn't see the closing channel token. Without Fix-A, lm-eval reads only `content` and scores those responses as empty (= wrong on multiple-choice tasks). On ARC-Challenge this produced 1027 empty / 1172 total → 12.37% pass. On the other 7 benches, the silent-empty rate stayed below 10% (because the prompt templates land in a regime where the model emits a content phase reliably), so their scores are within the canonical band.

	The lesson is captured permanently in [`omnimergekit/eval/EVAL_PROTOCOL.md`](https://github.com/mann1x/omnimergekit/blob/main/eval/EVAL_PROTOCOL.md) and the canonical pod bootstrap (`pod_setup_eval_envs.sh`) auto-applies Fix-A — every new eval pod now starts in the patched state.

	## License

	This model inherits the [Gemma license](https://ai.google.dev/gemma/terms) from the base model.

	## Acknowledgements

	- Google for the base Gemma 4 26B-A4B-it model
	- The OmniMergeKit project for the surgery + eval toolkit
	- The vLLM and modelopt teams for the NVFP4A16 serving / quantization pipeline
	- bartowski for the calibration data v5 used in imatrix GGUF quantization