gemma-4-A4B-98e-v5-coder-NVFP4A16
NVFP4-A16 quantization (4-bit FP4 weights, BF16 activations) of ManniX-ITA/gemma-4-A4B-98e-v5-coder-it — the code-leaning 98e MoE variant of Gemma 4 26B-A4B.
13 GB on disk, runs on a single 24 GB GPU at 32 k context, native vLLM serving.
For the full design rationale, drop-map recipe, audit data, and 14–22B cohort comparison, see the BF16 base card. This card focuses on the NVFP4A16 quant itself and how to serve it.
Quant recipe
- Quantizer:
omnimergekit/scripts/quantize_any.py --method nvfp4a16 - Engine:
nvidia-modelopt==0.43.0withNVFP4_DEFAULT_CFG(the pinned working version —0.44.0has two Gemma 4 regressions, seememory/feedback_modelopt_pin_0_43.md) - Calibration: 128 samples from
tatsu-lab/alpaca, max_length=512 - Excluded modules:
*vision_tower*,*embed_vision*,*embed_audio*(kept BF16 for vLLM'sgemma4_mmloader) - Producer hardware: vast.ai RTX 6000 Ada 48 GB pod, stack-pinned to vLLM 0.20.2 stock + Fix-A lm-eval patch
- Producer stack: transformers 5.5.0, torch 2.10.0+cu128, safetensors 0.7.0
- Date: 2026-05-17
Eval — canonical 9-bench cohort (NVFP4A16, vLLM 0.20.2 stock, greedy)
Greedy sampler (temperature=0, top_p=1, top_k=0, do_sample=False), thinking_token_budget=12288, max_gen_toks=16384. Bench definitions: omnimergekit/eval/templates/. 128e / v4 numbers reuse the published canonical baseline from the v5-it card.
| Bench (n) | 128e ref | 98e v4 | 98e v5-coder (this model) | 31B-it (dense) | Δ (v5-coder − v4) |
|---|---|---|---|---|---|
| ARC-Challenge chat (1172) | 95.99 % | 95.99 % | 95.31 % | 98.04 % | -0.68 |
| GPQA Diamond (198) | 73.23 % | 69.19 % | 68.69 % | 81.31 % | −0.50 |
| GSM8K-100 | 91.00 % | 86.00 % | 86.00 % | 93.00 % | 0.00 |
| MATH-500-100 | 89.00 % | 89.00 % | 92.00 % | 97.00 % | +3.00 |
| AIME 2024 (30) | 36.67 % | 36.67 % | 36.67 % | 76.67 % | 0.00 |
| IFEval-100 (prompt_strict) | 95.00 % | 93.00 % | 94.00 % | 96.00 % | +1.00 |
| HumanEval-164 chat | 96.95 % | 96.95 % | 98.17 % | 97.56 % | +1.22 |
| HumanEval+ chat (164) | 92.07 % | 91.46 % | 92.68 % | 92.07 % | +1.22 |
| LCB-medium-55 (v4 split) | 87.27 % | 78.18 % | 85.45 % (47/55) | 96.36 % | +7.27 |
Bold = best in 4-column cohort. ARC was rescored 2026-05-18 on stack-pinned solidpc (stock vLLM 0.20.2 + Fix-A patched lm-eval) and landed at 95.31 % ±0.62 pp, in-range with 128e (95.99 %) and v4 (95.99 %) and within stderr of v5 (96.59 %). The original ⚠ on this row was the pre-Fix-A silent-empty bug (87.6 % of responses had empty content; lm-eval's stock parse_generations didn't fall back to reasoning_content). See v5-coder bf16 card §audit for the per-bench taint analysis.
Key takeaways:
- Code frontier at 14-22B: v5-coder beats every published 14-22B coder on HumanEval / HumanEval+ at this run (Qwen2.5-Coder-14B 89.6 / 87.2, DeepSeek-Coder-V2-Lite 81.1 / —, Codestral-22B 81.1 / —). Audit confirmed real, not memorized, not silent-empty.
- +7.27 pp on LCB-medium vs v4 — the largest delta in the cohort, exactly where the C6 drop map was designed to help.
- Math gains too: MATH-500 +3 pp vs v4, GSM8K flat. The C6 layer-relevance signal correlates with code reasoning more than v4's drop assumed.
- Reasoning/general benches flat: GPQA −0.5, AIME 0.0, IFEval +1 — the recipe is a deliberate code rewrite of v4, not a generalist upgrade.
Usage with vLLM
python -m vllm.entrypoints.openai.api_server \
--model ManniX-ITA/gemma-4-A4B-98e-v5-coder-NVFP4A16 \
--served-model-name 98e_v5_coder_nvfp4a16 \
--port 8099 \
--gpu-memory-utilization 0.55 \
--max-model-len 32768 \
--max-num-batched-tokens 8192 \
--dtype bfloat16 \
--trust-remote-code \
--reasoning-parser gemma4 \
--default-chat-template-kwargs '{"enable_thinking": true}'
Per-request:
{"thinking_token_budget": 12288, "temperature": 0.0, "top_p": 1.0, "top_k": 0}
Gemma 4 MoE gotchas (baked into omnimergekit/scripts/pod_eval_31b.sh):
--max-num-batched-tokens 8192is mandatory (Gemma 4 MM encoder budget; default 2048 < max_tokens_per_mm_item 2496 → boot crash).--max-model-len ≥ 32768is recommended for the canonical bench recipe (templates ask formax_gen_toks=49152; if you hit prompt+output > max-model-len, vLLM returns HTTP 400). Use 65536 if you have headroom.preprocessor_config.jsonmust exist in the model dir — present in this repo.
VRAM
| GPU | Runs? | Memory (peak) | Notes |
|---|---|---|---|
| RTX 3090 24 GB | ✅ | ~13 GB | Single-GPU, 32 k ctx; throughput ~22 tok/s (Marlin, weight-only FP4) |
| RTX 4090 24 GB | ✅ | ~13 GB | Same; faster than 3090 |
| RTX 6000 Ada 48 GB | ✅ | ~13 GB | Single-GPU comfortable, room for 65 k ctx |
| L40 / L40S 48 GB | ✅ | ~13 GB | Sustained ~240 tok/s on canonical benches |
| A100 / H100 80 GB | ✅ | ~13 GB | Native FP4 on H100 = much higher throughput |
| 2× RTX 3090 / 4090 | ✅ | ~7 GB / GPU | TP=2, longer ctx |
Related models
| Model | Description |
|---|---|
| gemma-4-A4B-98e-v5-coder-it | BF16 base — full design notes, audit, 14-22B cohort comparison |
| gemma-4-A4B-98e-v5-coder-it-GGUF | GGUF tier sweep (Q2_K → Q8_0, IQ2-IQ4) + 5 ContribDynamic CD-* per-layer quants |
| gemma-4-A4B-98e-v5-it | Sibling v5 (v4 + α=1.2 shared FFN, generalist) |
| gemma-4-A4B-98e-v5-NVFP4A16 | Sibling v5 NVFP4A16 |
| gemma-4-A4B-98e-v4-it | Parent (multi-class CD-map drop, no α) |
| gemma-4-A4B-98e-v4-NVFP4A16 | v4 NVFP4A16 |
| Gemma-4-31B-it-NVFP4A16 | Dense 31B baseline for cohort comparisons |
Ollama
ollama pull mannix/gemma4-98e-v5-coder
The Ollama repo at mannix/gemma4-98e-v5-coder carries the GGUF tier sweep — same recipe, llama.cpp serving, no vLLM dependency.
Source
- BF16 base:
ManniX-ITA/gemma-4-A4B-98e-v5-coder-it - Quant script:
omnimergekit/scripts/quantize_any.py - Eval orchestrator:
omnimergekit/scripts/eval_suite_vllm.sh+eval/templates/ - Eval protocol:
omnimergekit/eval/EVAL_PROTOCOL.md(v2, vLLM-first, sampler-pinned greedy, stack-pinned per cohort)
License
Inherits the Gemma license from the base model.
Acknowledgements
- Google for the base Gemma 4 26B-A4B-it model
- OmniMergeKit for the surgery + eval toolkit
- The vLLM and NVIDIA ModelOpt teams for the NVFP4A16 serving / quantization pipeline
- Downloads last month
- 47
Model tree for ManniX-ITA/gemma-4-A4B-98e-v5-coder-NVFP4A16
Base model
google/gemma-4-26B-A4B