Instructions to use ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF", filename="gemma-4-A4B-98e-v5-coder-it-CD-IQ3_M_h.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF:Q4_K_M
Use Docker
docker model run hf.co/ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF with Ollama:
ollama run hf.co/ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF:Q4_K_M
- Unsloth Studio new
How to use ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF to start chatting
- Pi new
How to use ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF with Docker Model Runner:
docker model run hf.co/ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF:Q4_K_M
- Lemonade
How to use ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.gemma-4-A4B-98e-v5-coder-it-GGUF-Q4_K_M
List all available models
lemonade list
llm.create_chat_completion(
messages = "No input example has been defined for this model task."
)- gemma-4-A4B-98e-v5-coder-it-GGUF
- Gemma 4 A4B 98-Expert v5-coder (20.8B) — code-leaning prune
gemma-4-A4B-98e-v5-coder-it-GGUF
GGUF quantizations of ManniX-ITA/gemma-4-A4B-98e-v5-coder-it.
All quants made using imatrix with calibration data v5.
Removed tiers (2026-05-19 / 2026-05-20) — collapsed, broken, or Pareto-dominated
Seven GGUFs were deleted from this repo after the full HE+ sweep landed. They are intentionally absent from the tables below; do not re-quantize at these tiers without the rationale being addressed.
| File (removed) | Size | HE+ | Why removed |
|---|---|---|---|
Q2_K.gguf |
8.40 GB | 6.10% | plain k-quant collapses on this MoE chassis at 2-bit |
CD-Q2_K.gguf |
8.40 GB | 4.27% | full-override at 2-bit is worse than plain — no headroom for per-layer differentiation when every body tensor is already at the codebook floor |
IQ2_XXS.gguf |
7.37 GB | crashed | T66 eval: 12+ consecutive HTTP 500s from llama-server due to a 1-in-10 rumination-lock pattern at this bpw |
IQ2_XS.gguf |
7.77 GB | 64.63% | borderline-usable but well below IQ2_S (7.83 GB / 85.37%) at +0.06 GB — IQ2_S is the 2-bit pick |
CD-Q4_K_M.gguf |
10.59 GB | 73.78% | T58 rebuild had bad attn_v handling (confirmed by T61b ablation); superseded by CD-Q4_K_M_L.gguf (13 GB / 93.29%) below |
CD-IQ3_M_L.gguf |
10.19 GB | 91.46% | dominated by plain IQ3_M (9.82 GB / 91.46% — identical score at -0.37 GB); removed 2026-05-20 after full Q-family rescore |
CD-Q6_K_L.gguf |
14.95 GB | 91.46% | dominated by full-override CD-Q6_K (15.44 GB / 92.68%) and plain Q6_K (17.81 GB / 92.68%); removed 2026-05-20 |
Available Quantizations
Two CD families are published. The plain k-quants + IQ tiers are also here for direct comparison. HE+ pass@1 is the 164-question humaneval_plus_chat shadow task on the same 3090 rig + greedy recipe used for the rest of the card.
ContribDynamic full-override (CD-*) — per-layer importance from v3 imix; explicit per-tensor type assignment everywhere. Best disk-efficiency. All sizes are HF-displayed decimal GB (10⁹ bytes); ls -lh would show ~7% smaller values in GiB.
| Quantization | Size | HE+ | Notes |
|---|---|---|---|
| CD-Q6_K | 15.44 GB | 92.68% | ties plain Q6_K (17.81 GB) at -2.37 GB |
| CD-Q5_K_M | 13.01 GB | 92.68% | best size/score at this band among full-override |
| CD-IQ4_K_M (Canary W) | 10.29 GB | 92.07% | matches plain Q4_K_M at -22% disk — recommended sub-11 GB CD pick |
| CD-Q3_K_M | 9.94 GB | 89.02% | +4.87pp over plain Q3_K_M |
| CD-Q2_K | (removed) | 4.27% | model collapses at 2-bit (cohort-wide; see below) |
CD-mix _L family (2026-05-19, pruned 2026-05-20) — body-only --tensor-type-file override; heuristic protects attn_v / ffn_down / token_embd. Slightly heavier than full-override, cleaner abstraction. CD-Q4_K_M_L ties the cohort practical top (93.29%) at -0.24 GB vs plain Q4_K_M. Two tiers (CD-IQ3_M_L, CD-Q6_K_L) were Pareto-dominated by plain quants on the 2026-05-20 rescore and removed from this repo + ollama.
| Quantization | Size | HE+ | Notes |
|---|---|---|---|
| CD-Q4_K_M_L | 13.00 GB | 93.29% | ties cohort practical top; +1.22pp over plain Q4_K_M at -0.24 GB |
| CD-IQ3_XS_L | 9.27 GB | 90.24% | smallest viable code-grade _L quant; +5.48pp over plain Q3_K_M at -1.24 GB |
| 10.19 GB | 91.46% | dominated by plain IQ3_M (9.82 GB / 91.46% — same score at -0.37 GB) | |
| 14.95 GB | 91.46% | dominated by full-override CD-Q6_K (15.44 GB / 92.68%) and plain Q6_K (17.81 GB / 92.68%) |
CD-mix _h family (2026-05-19, new) — same body-only override as _L, but the LOW tier is kept at MID's codebook class instead of being aggressively pushed lower. Recovers ~3–4pp at the IQ3_XS band relative to _L.
| Quantization | Size | HE+ | Notes |
|---|---|---|---|
| CD-IQ3_XS_h | 9.72 GB | 90.85% | +0.61pp over CD-IQ3_XS_L at +0.45 GB |
| CD-IQ3_M_h | 11.15 GB | 92.07% | ties plain Q4_K_M (13.24 GB) at -2.09 GB; -1.22pp vs CD-Q4_K_M_L (13.00 GB / 93.29%) at -1.85 GB |
CD-Q4_K_M_h (13.29 GB / 92.07%) and CD-Q5_K_M_h (13.94 GB / 91.46%) were also built and evaluated but are intentionally not published: at their bpw they are Pareto-dominated by CD-IQ3_M_h (-2.14 GB at the same 92.07%) and by CD-Q4_K_M_L (13.00 GB / 93.29% — strictly higher score at smaller disk). CD-Q5_K_M_L (mix at the 5-bit band) is opt-in only — at ~13.9 GB / 90.85% it's heavier and lower than full-override CD-Q5_K_M for v5-coder. All three remain generatable via --recipes.
The full plain + IQ + CD comparison sweep with iso-disk analysis is in the head-to-head table.
How to Use
With llama.cpp:
llama-server -m gemma-4-A4B-98e-v5-coder-it-Q4_K_M.gguf -c 8192 -ngl 99
With ollama (requires Modelfile or HF direct load).
Original Model Card
Gemma 4 A4B 98-Expert v5-coder (20.8B) — code-leaning prune
A research checkpoint that takes 98e v4 and replaces its drop map with C6 layer-relevance-weighted v4-floor breadth=50 — a recipe that protects code/math experts more tightly per-layer than v4's multi-class CD-max. No shared FFN scaling. Same 98e shape, same router, same attention, same norms.
Quantized formats
| Format | Repo | Notes |
|---|---|---|
| GGUF (llama.cpp / ollama) | ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF |
Full Bartowski tier sweep (Q2_K → Q8_0, IQ2-IQ4) + 5 ContribDynamic CD-* per-layer quants. F16 baseline included. |
| NVFP4A16 (vLLM) | ManniX-ITA/gemma-4-A4B-98e-v5-coder-NVFP4A16 |
~13 GB, native vLLM, produced via modelopt==0.43.0. |
| Ollama | mannix/gemma4-98e-v5-coder |
Same GGUF tier sweep, ready for ollama pull. |
| 98e v4 | 98e v5-coder (this model) | |
|---|---|---|
| Total params | 20.8B | ~20.8B |
| Experts per layer | 98 (30 dropped) | 98 (30 dropped) |
| Drop map | multi-class CD-map (max), p16 | C6 layer-relevance-weighted v4-floor, breadth=50 |
| Shared FFN α | 1.0 | 1.0 (none) |
Eval status — complete (9/9). ARC-Challenge was rescored 2026-05-18 on stack-pinned solidpc (stock vLLM 0.20.2 + Fix-A patched lm-eval) → 95.31 % ±0.62 pp, retiring the prior ⚠ from the silent-empty Fix-A pathology. The original 12.37 % was 87.6 %
content=""responses because lm-eval's stockopenai_completions.parse_generationsdidn't fall back toreasoning_content.
Scoreboard — NVFP4A16, vLLM, greedy
NVFP4A16 quant via nvidia-modelopt 0.43.0, served via vLLM 0.20.2 with --reasoning-parser gemma4, enable_thinking=true, thinking_token_budget=12288. Sampler: greedy (T=0, top_p=1, top_k=0, do_sample=false) — the canonical Gemma 4 9-bench recipe.
| Bench (n) | 128e ref | 98e v4 | 98e v5-coder | Δ (v5-coder − v4) |
|---|---|---|---|---|
| ARC-Challenge-chat (1172) | 95.99% | 95.99% | 95.31% | −0.68 |
| GPQA Diamond flex (198) | 73.23% | 69.19% | 68.69% | −0.50 |
| GSM8K-100 flex | 91.00% | 86.00% | 86.00% | 0.00 |
| MATH-500-100 math_verify | 89.00% | 89.00% | 92.00% | +3.00 |
| AIME 2024 (30) | 36.67% | 36.67% | 36.67% | 0.00 |
| IFEval-100 (prompt_strict) | 95.00% | 93.00% | 94.00% | +1.00 |
| HumanEval-164 chat | 96.95% | 96.95% | 98.17% | +1.22 |
| HumanEval+-164 chat | 92.07% | 91.46% | 92.68% | +1.22 |
| LCB-medium-55 v4 | 87.27% | 78.18% | 85.45% (47/55) | +7.27 |
Reading the deltas: v5-coder is a deliberate code-leaning rewrite of v4's drop ranking. The C6 drop map protects per-layer code-relevance signal harder than v4's CD-max aggregation does — that shows up cleanly as +1.22 / +1.22 / +7.27 on the three code benches (HE / HE+ / LCB-medium), with MATH-500 also recovering +3.00pp (math-on-text is correlated with code reasoning more than v4's drop assumed). Reasoning and general-knowledge benches are essentially flat: GPQA −0.50pp, GSM8K 0.00, AIME 0.00, IFEval +1.00. The big win is LCB-medium +7.27pp — that's well outside the ±2pp single-run noise floor on a 55-problem bench and matches the recipe's design intent (preserve code-specialist experts at the cost of nothing).
ARC's prior −83.6pp gap (12.37% vs 95.99%) was not a v5-coder regression — it was the silent-empty Fix-A bug on the unpatched pod that ran it. 87.6% of the 1,172 ARC samples came back with empty content because vLLM 0.20.2 + Gemma 4 + reasoning-parser routes the answer to reasoning_content when the closing channel token isn't seen, and lm-eval's stock parse_generations reads content only. The model itself was fine; the eval harness wasn't patched. Stack-pinned rescore 2026-05-18 landed at 95.31 % — exactly inside the predicted 95–97 % band and within stderr of 128e (95.99 %) and v4 (95.99 %).
HumanEval / HumanEval+ sanity audit
98.17% / 92.68% sits at the top of the 14–22B band (see lazy comparison below), so the samples files were re-audited to rule out a scoring artifact. Audit script: scripts/audit_v5coder_he.py.
| Bench | n | score | empty | fenced | chars p10/p50/p90 | verbatim-canonical-in-gen |
|---|---|---|---|---|---|---|
| HumanEval-164 | 164 | 0.9817 | 1 | 163 | 270 / 642 / 1324 | 3.0% (5/164) |
| HumanEval+-164 | 164 | 0.9268 | 3 | 161 | 270 / 620 / 1244 | 1.8% (3/164) |
- Fences are stripped correctly. 163/164 (HE) and 161/164 (HE+) outputs are wrapped in
\``pythonchat fences. lm-eval's chat-aware HE/HE+ scorer (built on thehumaneval_chatshadow task — see [v4 card §Eval Caveat](https://huggingface.co/ManniX-ITA/gemma-4-A4B-98e-v4-it)) extracts the function body beforeexec(). If fences were leaking through, pass@1 would collapse to ~0 (per [feedback_gemma4_chat_only_completions_breaks.md`](https://github.com/mann1x/omnimergekit/blob/main/memory/feedback_gemma4_chat_only_completions_breaks.md)); 98.17% is only possible with correct stripping. - Empty-response rate is normal. 1/164 HE and 3/164 HE+ are blank — within Gemma 4 reasoning-mode noise; not the 87.6%-empty pathology that hit ARC.
- No catastrophic contamination. Only 3.0% (HE) and 1.8% (HE+) of generations contain the canonical solution as a verbatim substring. A model that had memorized HE from pretraining would show 30%+; the few verbatim matches here are short structurally-inevitable solutions (e.g.
has_close_elementsO(n²) double-loop). - HE → HE+ delta is healthy. −5.49pp drop across the +/− boundary. HE+ adds adversarial test cases that catch brittle solutions which pass the public tests but fail edge cases. A 0pp drop would actually be a memorization red flag; ~−5pp is the expected band for a strong-but-not-memorized model.
- Failures look real. The 3 HE failures are 1 empty (doc_id 122) plus 2 wrong-logic attempts (doc_id 140
fix_spaces, doc_id 145order_by_points). The 12 HE+ failures are mostly "passes basic tests, fails edge cases" — exactly the regime HE+ exists to expose.
Conclusion: 98.17 / 92.68 is real. Not a scoring artifact, not memorization, not silent-empty.
Lazy comparison vs the 14–22B coder field
For sense-of-scale on whether 98.17 is anomalous. All numbers are official model-card / paper / blog (linked).
Coder-specialized 14–22B:
| Model | Params | HE | HE+ | LCB (version) | Source |
|---|---|---|---|---|---|
| 98e v5-coder (this) | 20.8B / 4B MoE | 98.17 | 92.68 | 85.45 (LCB-medium-55 v4) | this card |
| Qwen2.5-Coder-14B-Instruct | 14.7B dense | 89.6 | 87.2 | 23.4 (LCB 07/24–11/24, pre-v4) | arXiv:2409.12186 |
| DeepSeek-Coder-V2-Lite-Instruct | 16B / 2.4B MoE | 81.1 | — | 24.3 (LCB 12/01–06/01) | arXiv:2406.11931 |
| Codestral-22B v1 | 22B dense | 81.1 | — | (not published) | Mistral blog |
| IBM Granite-20B-Code-Instruct | 20B dense | 60.4 | — | — | arXiv:2405.04324 |
Generalist 14–22B (notable code scores):
| Model | Params | HE | MATH | GPQA-D | IFEval | Source |
|---|---|---|---|---|---|---|
| 98e v5-coder (this) | 20.8B / 4B MoE | 98.17 | 92.00 (MATH-500) | 68.69 | 94.00 | this card |
| Phi-4 | 14B dense | 82.6 | 80.4 (MATH) | 56.1 | 63.0 | arXiv:2412.08905 |
| Qwen2.5-14B-Instruct | 14.7B dense | 81.7–86.2 | 73.0 (MATH) | 40.9 | 80.0 | Qwen blog |
| Mistral-Small-3 (24B, just above band) | 24B dense | ~84 | 70.6 (MATH) | 45.3 | 82.1 | Mistral blog |
Where v5-coder sits:
- HE / HE+: top of the band, ~+8–10pp above Qwen2.5-Coder-14B's 89.6 / 87.2 (the published field leader). The audit above rules out scoring artifacts; the gap is real on this run.
- LCB: not apples-to-apples with Qwen2.5-Coder or DS-Coder-V2-Lite. Those numbers are full LCB on pre-v4 problem windows (LCB-2024.07–11 and LCB-2024.12–06.01 respectively). v5-coder's 85.45% is LCB-medium-55 on v4 problems — a different subset and a different problem set. A fair comparison would require running Qwen2.5-Coder-14B on the same LCB-medium-55 v4 split, which nobody has published. Don't read +60pp into the LCB column.
- MATH-500 92.00 / GSM8K 86 / AIME 36.67: top of the band for math-on-text reasoning. Phi-4's 80.4 MATH is the closest generalist; v5-coder beats it by ~12pp. AIME 36.67 is currently the only published 14–22B AIME score in this comparison set (Qwen2.5-Coder and Codestral don't evaluate AIME).
- GPQA-Diamond 68.69 / IFEval 94.00: GPQA is materially above Phi-4 (56.1) and Qwen2.5-14B (40.9). IFEval 94 ties Mistral-Small-3 (82.1) and beats Phi-4 (63.0) — Phi-4's instruction-following is its known weakness.
Caveats on this comparison: different labs use different system prompts, different temperature/top_p, different "chat vs base" framings, different sampling counts. v5-coder is run greedy (T=0); some published numbers (e.g. Phi-4) use multi-sample averaging. Within-card deltas (v4 vs v5-coder) are the cleanest signal; cross-card deltas are noisy by ±2-5pp.
Same-Stack GGUF HE+ Sweep — v5-coder vs Qwen2.5-Coder-14B-Instruct
Head-to-head HumanEval+ (164-question, chat-aware shadow task) on identical hardware (single RTX 3090 24 GB) and identical eval recipe (llama-server -c 32768 -ngl 99 --parallel 2 --jinja --reasoning off, omk_eval llama backend, lm-eval humaneval_plus_chat, greedy T=0, max_gen_toks=16384). Qwen GGUFs are bartowski's Qwen2.5-Coder-14B-Instruct-GGUF.
The "Lazy comparison" table above uses paper-reported numbers; this section is what the same rig and same scorer actually measure.
v5-coder (20.8B total / 4B-active MoE) — plain k-quant full sweep (2026-05-20, 15 tiers)
| Tier | File size | bpw | HE+ pass@1 |
|---|---|---|---|
| 8.40 GB | 3.23 | 6.10% (collapse — see removed tiers) | |
| Q3_K_S | 9.68 GB | 3.72 | 76.83% |
| Q3_K_M | 10.51 GB | 4.04 | 84.15% |
| Q3_K_XL | 10.69 GB | 4.11 | 84.15% |
| Q3_K_L | 10.94 GB | 4.21 | 90.24% |
| Q4_0 | 11.42 GB | 4.39 | 90.85% |
| Q4_K_S | 12.21 GB | 4.70 | 93.29% ⭐ |
| Q4_1 | 12.61 GB | 4.85 | 92.07% |
| Q4_K_M | 13.24 GB | 5.09 | 92.07% |
| Q4_K_L | 13.42 GB | 5.16 | 92.07% |
| Q5_K_S | 14.19 GB | 5.46 | 92.07% |
| Q5_K_M | 15.07 GB | 5.80 | 91.46% |
| Q5_K_L | 15.25 GB | 5.87 | 93.29% ⭐ |
| Q6_K | 17.81 GB | 6.85 | 92.68% |
| Q6_K_L | 17.98 GB | 6.92 | 92.68% |
| Q8_0 | 21.16 GB | 8.14 | 93.90% ⭐⭐ |
Three reads from the plain sweep.
- Q4_K_S is the new plain sweet spot. At 12.21 GB / 4.70 bpw it scores 93.29% — +1.22pp over Q4_K_M at -1.03 GB, and ties the cohort practical top. This is the recommended plain quant for v5-coder. Q4_K_M is still fine (92.07%) but Q4_K_S is strictly better.
- Q8_0 is the cohort top at 93.90%. Almost +1pp over the 93.29% tier and -2.7pp gap to F16. Only pick it if you have the disk; it's not a recommendation.
- Q5_K_L (15.25 GB / 93.29%) and Q3_K_L (10.94 GB / 90.24%) are the L-band winners. Both beat the M/S counterparts at their byte size by clean margins. Q3_K_XL is a 0.25 GB heavier Q3_K_M and scores the same — skip it.
The big gotcha: Q3_K_S collapses (76.83%) where Q3_K_M is fine (84.15%) — the bpw floor for plain k-quants on this MoE chassis is ~4 bpw. Anything sub-4 bpw needs the IQ codebook (see next table) or the CD-mix _L family above.
v5-coder low-bpw IQ tier sweep (T64, complete 2026-05-19)
Sub-4-bpw plain k-quants collapse on this MoE chassis (Q2_K at 3.23 bpw is the demonstration). The i-matrix codebook IQ-quants behave very differently — they survive much further down. Sweep ran on the same rig + recipe as the plain quants above:
| Tier | File size | bpw | HE+ pass@1 |
|---|---|---|---|
| IQ2_XS | 7.77 GB | 2.99 | 64.63% (removed) |
| IQ2_S | 7.83 GB | 3.01 | 85.37% |
| IQ2_M | 8.22 GB | 3.16 | 86.59% |
| IQ3_XXS | 8.95 GB | 3.44 | 89.63% |
| IQ3_XS | 9.22 GB | 3.55 | 89.63% |
| IQ3_M | 9.82 GB | 3.78 | 91.46% |
| IQ4_XS | 11.01 GB | 4.23 | 93.29% ⭐ |
| IQ4_NL | 11.42 GB | 4.39 | 92.68% |
Four reads from the IQ sweep:
- IQ4_XS is the new sub-12 GB top. At 11.01 GB / 4.23 bpw it scores 93.29% — tying the cohort practical top with the smallest disk footprint at that score band. Beats plain Q4_K_M (13.24 GB / 92.07%) by +1.22pp at -2.23 GB. If you have 11 GB of disk, this is the pick.
- IQ2_S is the 2-bit cliff edge. At 7.83 GB / 3.01 bpw it scores 85.37% — only +0.06 GB / +0.02 bpw above IQ2_XS (7.77 GB / 64.63%, removed) but +20.74pp higher. Right next door, IQ2_XXS (7.37 GB / 2.81 bpw, also removed) rumination-locks the server. The codebook quality threshold for v5-coder on HumanEval+ sits between 2.99 and 3.01 bpw.
- IQ codebook ≫ K-quant body at 3-bit. IQ3_M (3.78 bpw) at 91.46% beats Q3_K_L (4.21 bpw) at 90.24% by +1.22pp while running 0.43 bpw lower. Q3_K_S at 3.72 bpw collapses to 76.83% — 14.6pp below IQ3_M at near-identical bpw. The imatrix codebook protects the value-vector subspace through the sub-4-bpw zone where fixed-block k-quants lose attention precision.
- IQ3_XXS at 8.95 GB / 89.63% is the size winner in the sub-9.5 GB band — strict improvement over plain Q3_K_M (10.51 GB / 84.15%) on every axis (-1.6 GB, -0.6 bpw, +5.48pp). IQ3_XS and IQ3_XXS tie at 89.63% — pick IQ3_XXS for the smaller bytes.
Qwen2.5-Coder-14B-Instruct (14.7B dense) — bartowski quants
| Tier | File size | bpw | HE+ pass@1 |
|---|---|---|---|
| IQ4_XS | 8.12 GB | 4.42 | 84.76% |
| Q4_0 | 8.54 GB | 4.65 | 84.15% |
| Q4_K_M | 8.99 GB | 4.89 | 85.37% |
| Q5_K_M | 10.51 GB | 5.72 | 83.54% |
| Q6_K | 12.12 GB | 6.60 | 84.76% |
| Q8_0 | 15.70 GB | 8.54 | 84.76% |
Qwen sits at 83–85% across the whole tier ladder. The paper-reported 87.2 HE+ is ~2pp above what bartowski's GGUFs deliver on this stack — a known llama-server chat-template vs vLLM-temp=0 quirk, not a quant defect.
Head-to-head by file size (v5-coder runs lower bpw at the same disk)
Pairing by tier name is misleading here — v5-coder is a 20.8B-total MoE and Qwen is a 14.7B dense, so the same tier name maps to different file sizes. The fair comparison is iso-disk: at a given GB budget, which model wins HE+? At every band, v5-coder uses 2–3 bpw less than Qwen and still scores higher.
| Disk band | Qwen2.5-Coder-14B (size / bpw / HE+) | v5-coder best (size / bpw / HE+) | Δ HE+ |
|---|---|---|---|
| ~21 GB | (none — Qwen ceiling is Q8_0 15.7 GB) | Q8_0 21.16 GB / 8.14 / 93.90% ⭐⭐ | new top |
| ~15 GB | Q8_0 15.70 GB / 8.54 / 84.76% | Q5_K_L 15.25 GB / 5.87 / 93.29% | +8.53 |
| ~15 GB | Q8_0 15.70 GB / 8.54 / 84.76% | CD-Q6_K (full) 15.44 GB / 5.94 / 92.68% | +7.92 |
| ~14 GB | (gap) | Q5_K_S 14.19 GB / 5.46 / 92.07% | — |
| ~13 GB | Q6_K 12.12 GB / 6.60 / 84.76% | CD-Q4_K_M_L 13.00 GB / 5.00 / 93.29% | +8.53 |
| ~13 GB | Q6_K 12.12 GB / 6.60 / 84.76% | CD-Q5_K_M (full) 13.01 GB / 5.00 / 92.68% | +7.92 |
| ~12 GB | Q6_K 12.12 GB / 6.60 / 84.76% | Q4_K_S 12.21 GB / 4.70 / 93.29% ⭐ | +8.53 |
| ~11 GB | Q5_K_M 10.51 GB / 5.72 / 83.54% | IQ4_XS 11.01 GB / 4.23 / 93.29% ⭐ | +9.75 |
| ~10.5 GB | Q5_K_M 10.51 GB / 5.72 / 83.54% | Q3_K_L 10.94 GB / 4.21 / 90.24% | +6.70 |
| ~10 GB | Q5_K_M 10.51 GB / 5.72 / 83.54% | IQ3_M 9.82 GB / 3.78 / 91.46% | +7.92 |
| ~10 GB | Q5_K_M 10.51 GB / 5.72 / 83.54% | CD-IQ4_K_M (Canary W) 10.29 GB / 3.96 / 92.07% | +8.53 |
| ~9.7 GB | Q4_K_M 8.99 GB / 4.89 / 85.37% | CD-IQ3_XS_h 9.72 GB / 3.74 / 90.85% | +5.48 |
| ~9.3 GB | Q4_K_M 8.99 GB / 4.89 / 85.37% | CD-IQ3_XS_L 9.27 GB / 3.57 / 90.24% | +4.87 |
| ~9.2 GB | Q4_K_M 8.99 GB / 4.89 / 85.37% | IQ3_XS 9.22 GB / 3.55 / 89.63% | +4.26 |
| ~9 GB | Q4_K_M 8.99 GB / 4.89 / 85.37% | IQ3_XXS 8.95 GB / 3.44 / 89.63% | +4.26 |
| ~8 GB | IQ4_XS 8.12 GB / 4.42 / 84.76% | IQ2_S 7.83 GB / 3.01 / 85.37% | +0.61 |
Four things to read from this table.
The cohort top is now
Q8_0at 93.90% — but the disk cost (21.16 GB) makes it impractical for most. The practical cohort top is a five-way tie at 93.29%:Q8_0(next-best non-tied),Q4_K_S(12.21 GB),IQ4_XS(11.01 GB),Q5_K_L(15.25 GB), andCD-Q4_K_M_L(13.00 GB). Pick by disk budget: 11 GB →IQ4_XS, 12 GB →Q4_K_S, 13 GB →CD-Q4_K_M_L, 15 GB →Q5_K_L.CD-IQ4_K_M(Canary W) is still the sub-11 GB disk-efficiency winner. 10.29 GB / 92.07% — at -22% disk vs the practical-top tier and only -1.22pp vs the 93.29% band. Below that, plainIQ3_Mat 9.82 GB / 91.46%,CD-IQ3_XS_hat 9.72 GB / 90.85%, andCD-IQ3_XS_Lat 9.27 GB / 90.24% extend the ladder into a band where Qwen has no comparable tier.v5-coder now competes below 8 GB.
IQ2_Sat 7.83 GB / 3.01 bpw scores 85.37% — narrowly beating QwenIQ4_XS(8.12 GB / 4.42 bpw / 84.76%) by +0.61pp at -0.29 GB and -1.41 bpw. The previous "v5-coder floor is 8.95 GB" hole is closed; IQ2_S is the smallest publishable code-grade quant in the cohort. PlainQ2_K(8.40 GB) still collapses to 6.10% — the 2-bit cliff is an IQ-codebook phenomenon, not a tier-bytes one.vs Qwen the picture is unchanged but sharpened. At every disk band v5-coder uses 1.3–3.5 bpw less than the matching Qwen tier and still wins by +0.61 to +9.75pp. The 11 GB band (v5-coder
IQ4_XSvs QwenQ5_K_M) is the new biggest gap: +9.75pp at -1.49 bpw.
Pure tier-name matching (Qwen Q4_K_M vs v5-coder Q4_K_M etc.) would put v5-coder ~4 GB larger at every tier and ~+7pp ahead. That comparison is symmetric but unfair to Qwen's smaller footprint. The iso-disk view above is the one to plan VRAM around.
CD recipes — both the full-override CD-* family and the new CD-mix
_Lfamily — are open-source. Generator scripts live in omnimergekit/scripts/generate_cd_maps.py (full override) and generate_cd_maps_mix.py (mix family, 13 recipes including_hybrid). Per-layer importance source:v5coder_layer_importance_v3_imix.json(neuron×0.5 + imatrix×0.5).
Run logs and samples live under eval_results_hep_sweep/humanevalplus_full/ and pod_archive/ in the project tree.
What changed vs v4 (mechanical detail)
Identical surgery flow to v4 with one substitution — a different drop map.
- Same base:
google/gemma-4-26B-A4B-it(128e). - Same drop count: 30 experts per layer (98 retained).
- Same
protect_top=16shield. - Different ranking signal: instead of
score[layer][expert] = max over normalized classes (wnorm·α + tc)(v4), v5-coder scores each expert by a layer-relevance-weighted floor against v4's keep set, with breadth=50 controlling how many top-relevance experts get the floor lift before the bottom-30 cutoff is taken. The recipe scripts live in omnimergekit (T25 / T28 / T30 / C6 series — seefeedback_*memory for the ablation history). - Same downstream: slice expert tensors, resize MoE router
proj.weightfrom[128, hidden] → [98, hidden], updateconfig.json: num_experts=98, GGUF conversion + quant pipeline unchanged.
No shared FFN scaling (verified: layers.0.mlp.down_proj.weight is byte-for-byte identical to v4 in BF16).
When to pick which 98e variant
| Variant | Lean | Pick when |
|---|---|---|
| v3 | pooled TF (no class signal) | reference baseline; you want the original v3 numbers |
| v4 | balanced (5-class CD-map) | general-purpose; first 98e you'd default to |
| v5 | v4 + shared FFN α=1.2 | when you want v4 with a louder expert-mixture residual (research checkpoint) |
| v5-coder (this) | code-leaning C6 floor | HumanEval / HumanEval+ / LCB / MultiPL-E workloads; +7pp on LCB-medium |
Usage
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"ManniX-ITA/gemma-4-A4B-98e-v5-coder-it",
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="eager", # Gemma 4 head_dim=512 — FA2 not supported
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-A4B-98e-v5-coder-it")
msgs = [{"role": "user", "content": "Write a Python function that reverses a binary tree in-place."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=2048, do_sample=False) # greedy, canonical recipe
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
vLLM (NVFP4A16, canonical eval recipe)
python -m vllm.entrypoints.openai.api_server \
--model ManniX-ITA/gemma-4-A4B-98e-v5-coder-NVFP4A16 \
--served-model-name 98e_v5_coder_nvfp4a16 \
--port 8099 \
--gpu-memory-utilization 0.55 \
--max-model-len 32768 \
--max-num-batched-tokens 8192 \
--dtype bfloat16 \
--trust-remote-code \
--reasoning-parser gemma4 \
--default-chat-template-kwargs '{"enable_thinking": true}'
llama.cpp (GGUF)
llama-server -m gemma-4-A4B-98e-v5-coder-it-Q6_K.gguf \
--port 8099 -c 32768 -ngl 99 --no-warmup \
--jinja --reasoning-format deepseek --reasoning-budget 12288 \
--temp 0 --top-p 1 --top-k 0
GGUF quants: ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF — full Bartowski tier sweep (Q2_K → Q8_0, IQ2-IQ4) plus 5 ContribDynamic CD-* per-layer quants (CD-Q2_K, CD-Q3_K_L, CD-Q4_K_M, CD-Q5_K_M, CD-Q6_K). File naming: gemma-4-A4B-98e-v5-coder-it-<TIER>.gguf. (CD-Q3 currently on HF as CD-Q3_K_M.gguf; rename pending — see top of card.)
ollama
ollama pull mannix/gemma4-98e-v5-coder:Q6_K
ollama run mannix/gemma4-98e-v5-coder:Q6_K
Available tiers: mannix/gemma4-98e-v5-coder — same set as the GGUF repo (Q2_K … Q8_0, IQ2_* … IQ4_*, CD-Q2_K … CD-Q6_K). Modelfile uses Gemma 4 tool/parser template (matches mannix/gemma4-98e-v4 convention).
Related Models
| Model | Description |
|---|---|
| gemma-4-A4B-98e-v5-coder-NVFP4A16 | NVFP4A16 quant (~13 GB, vLLM-ready) |
| gemma-4-A4B-98e-v5-coder-it-GGUF | GGUF tier sweep + CD per-layer quants (llama.cpp / ollama) |
| mannix/gemma4-98e-v5-coder (Ollama) | Ollama-published version of the GGUF tier sweep |
| gemma-4-A4B-98e-v4-it | The apples-to-apples baseline for this model |
| gemma-4-A4B-98e-v5-it | Sibling: v4 + shared FFN α=1.2 |
| gemma-4-A4B-98e-v3-it | Earlier baseline (pooled TF map) |
Recipe + Code
OmniMergeKit is the canonical home. The relevant artifacts for this model:
scripts/v5coder_C6_v4floor_perlayer_breadth50_drop_map.json— the drop map (embedded inexpert_drop_metadata.json).scripts/expert_drop.py— drop applier (unchanged across v3/v4/v5/v5-coder).eval/EVAL_PROTOCOL.md— locked greedy methodology for the 9-bench suite, including the mandatory Fix-A patch for lm-eval'sopenai_completions.parse_generations(without it, ARC and other chat-completions benches silent-empty under Gemma 4 + reasoning-parser).
Eval Caveat — Fix-A is mandatory
This model was evaluated on a pod whose lm-eval install was missing the Fix-A reasoning_content fallback patch in openai_completions.parse_generations. Under vLLM 0.20.2 + --reasoning-parser gemma4, Gemma 4 emits the answer to the message's reasoning_content field and leaves content="" whenever the parser doesn't see the closing channel token. Without Fix-A, lm-eval reads only content and scores those responses as empty (= wrong on multiple-choice tasks). On ARC-Challenge this produced 1027 empty / 1172 total → 12.37% pass. On the other 7 benches, the silent-empty rate stayed below 10% (because the prompt templates land in a regime where the model emits a content phase reliably), so their scores are within the canonical band.
The lesson is captured permanently in omnimergekit/eval/EVAL_PROTOCOL.md and the canonical pod bootstrap (pod_setup_eval_envs.sh) auto-applies Fix-A — every new eval pod now starts in the patched state.
License
This model inherits the Gemma license from the base model.
Acknowledgements
- Google for the base Gemma 4 26B-A4B-it model
- The OmniMergeKit project for the surgery + eval toolkit
- The vLLM and modelopt teams for the NVFP4A16 serving / quantization pipeline
- bartowski for the calibration data v5 used in imatrix GGUF quantization
- Downloads last month
- 23,710
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Model tree for ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF
Base model
google/gemma-4-26B-A4B
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="ManniX-ITA/gemma-4-A4B-98e-v5-coder-it-GGUF", filename="", )