gemma-4-A4B-98e-v5-coder-NVFP4A16

NVFP4-A16 quantization (4-bit FP4 weights, BF16 activations) of ManniX-ITA/gemma-4-A4B-98e-v5-coder-it — the code-leaning 98e MoE variant of Gemma 4 26B-A4B.

13 GB on disk, runs on a single 24 GB GPU at 32 k context, native vLLM serving.

For the full design rationale, drop-map recipe, audit data, and 14–22B cohort comparison, see the BF16 base card. This card focuses on the NVFP4A16 quant itself and how to serve it.

Quant recipe

Quantizer: omnimergekit/scripts/quantize_any.py --method nvfp4a16
Engine: nvidia-modelopt==0.43.0 with NVFP4_DEFAULT_CFG (the pinned working version — 0.44.0 has two Gemma 4 regressions, see memory/feedback_modelopt_pin_0_43.md)
Calibration: 128 samples from tatsu-lab/alpaca, max_length=512
Excluded modules: *vision_tower*, *embed_vision*, *embed_audio* (kept BF16 for vLLM's gemma4_mm loader)
Producer hardware: vast.ai RTX 6000 Ada 48 GB pod, stack-pinned to vLLM 0.20.2 stock + Fix-A lm-eval patch
Producer stack: transformers 5.5.0, torch 2.10.0+cu128, safetensors 0.7.0
Date: 2026-05-17

Eval — canonical 9-bench cohort (NVFP4A16, vLLM 0.20.2 stock, greedy)

Greedy sampler (temperature=0, top_p=1, top_k=0, do_sample=False), thinking_token_budget=12288, max_gen_toks=16384. Bench definitions: omnimergekit/eval/templates/. 128e / v4 numbers reuse the published canonical baseline from the v5-it card.

Bench (n)	128e ref	98e v4	98e v5-coder (this model)	31B-it (dense)	Δ (v5-coder − v4)
ARC-Challenge chat (1172)	95.99 %	95.99 %	95.31 %	98.04 %	-0.68
GPQA Diamond (198)	73.23 %	69.19 %	68.69 %	81.31 %	−0.50
GSM8K-100	91.00 %	86.00 %	86.00 %	93.00 %	0.00
MATH-500-100	89.00 %	89.00 %	92.00 %	97.00 %	+3.00
AIME 2024 (30)	36.67 %	36.67 %	36.67 %	76.67 %	0.00
IFEval-100 (prompt_strict)	95.00 %	93.00 %	94.00 %	96.00 %	+1.00
HumanEval-164 chat	96.95 %	96.95 %	98.17 %	97.56 %	+1.22
HumanEval+ chat (164)	92.07 %	91.46 %	92.68 %	92.07 %	+1.22
LCB-medium-55 (v4 split)	87.27 %	78.18 %	85.45 % (47/55)	96.36 %	+7.27

Bold = best in 4-column cohort. ARC was rescored 2026-05-18 on stack-pinned solidpc (stock vLLM 0.20.2 + Fix-A patched lm-eval) and landed at 95.31 % ±0.62 pp, in-range with 128e (95.99 %) and v4 (95.99 %) and within stderr of v5 (96.59 %). The original ⚠ on this row was the pre-Fix-A silent-empty bug (87.6 % of responses had empty content; lm-eval's stock parse_generations didn't fall back to reasoning_content). See v5-coder bf16 card §audit for the per-bench taint analysis.

Key takeaways:

Code frontier at 14-22B: v5-coder beats every published 14-22B coder on HumanEval / HumanEval+ at this run (Qwen2.5-Coder-14B 89.6 / 87.2, DeepSeek-Coder-V2-Lite 81.1 / —, Codestral-22B 81.1 / —). Audit confirmed real, not memorized, not silent-empty.
+7.27 pp on LCB-medium vs v4 — the largest delta in the cohort, exactly where the C6 drop map was designed to help.
Math gains too: MATH-500 +3 pp vs v4, GSM8K flat. The C6 layer-relevance signal correlates with code reasoning more than v4's drop assumed.
Reasoning/general benches flat: GPQA −0.5, AIME 0.0, IFEval +1 — the recipe is a deliberate code rewrite of v4, not a generalist upgrade.

Usage with vLLM

python -m vllm.entrypoints.openai.api_server \
    --model ManniX-ITA/gemma-4-A4B-98e-v5-coder-NVFP4A16 \
    --served-model-name 98e_v5_coder_nvfp4a16 \
    --port 8099 \
    --gpu-memory-utilization 0.55 \
    --max-model-len 32768 \
    --max-num-batched-tokens 8192 \
    --dtype bfloat16 \
    --trust-remote-code \
    --reasoning-parser gemma4 \
    --default-chat-template-kwargs '{"enable_thinking": true}'

Per-request:

{"thinking_token_budget": 12288, "temperature": 0.0, "top_p": 1.0, "top_k": 0}

Gemma 4 MoE gotchas (baked into omnimergekit/scripts/pod_eval_31b.sh):

--max-num-batched-tokens 8192 is mandatory (Gemma 4 MM encoder budget; default 2048 < max_tokens_per_mm_item 2496 → boot crash).
--max-model-len ≥ 32768 is recommended for the canonical bench recipe (templates ask for max_gen_toks=49152; if you hit prompt+output > max-model-len, vLLM returns HTTP 400). Use 65536 if you have headroom.
preprocessor_config.json must exist in the model dir — present in this repo.

VRAM

GPU	Runs?	Memory (peak)	Notes
RTX 3090 24 GB	✅	~13 GB	Single-GPU, 32 k ctx; throughput ~22 tok/s (Marlin, weight-only FP4)
RTX 4090 24 GB	✅	~13 GB	Same; faster than 3090
RTX 6000 Ada 48 GB	✅	~13 GB	Single-GPU comfortable, room for 65 k ctx
L40 / L40S 48 GB	✅	~13 GB	Sustained ~240 tok/s on canonical benches
A100 / H100 80 GB	✅	~13 GB	Native FP4 on H100 = much higher throughput
2× RTX 3090 / 4090	✅	~7 GB / GPU	TP=2, longer ctx

Related models

Model	Description
gemma-4-A4B-98e-v5-coder-it	BF16 base — full design notes, audit, 14-22B cohort comparison
gemma-4-A4B-98e-v5-coder-it-GGUF	GGUF tier sweep (Q2_K → Q8_0, IQ2-IQ4) + 5 ContribDynamic CD-* per-layer quants
gemma-4-A4B-98e-v5-it	Sibling v5 (v4 + α=1.2 shared FFN, generalist)
gemma-4-A4B-98e-v5-NVFP4A16	Sibling v5 NVFP4A16
gemma-4-A4B-98e-v4-it	Parent (multi-class CD-map drop, no α)
gemma-4-A4B-98e-v4-NVFP4A16	v4 NVFP4A16
Gemma-4-31B-it-NVFP4A16	Dense 31B baseline for cohort comparisons

Ollama

ollama pull mannix/gemma4-98e-v5-coder

The Ollama repo at mannix/gemma4-98e-v5-coder carries the GGUF tier sweep — same recipe, llama.cpp serving, no vLLM dependency.

Source

BF16 base: ManniX-ITA/gemma-4-A4B-98e-v5-coder-it
Quant script: omnimergekit/scripts/quantize_any.py
Eval orchestrator: omnimergekit/scripts/eval_suite_vllm.sh + eval/templates/
Eval protocol: omnimergekit/eval/EVAL_PROTOCOL.md (v2, vLLM-first, sampler-pinned greedy, stack-pinned per cohort)

License

Inherits the Gemma license from the base model.

Acknowledgements

Google for the base Gemma 4 26B-A4B-it model
OmniMergeKit for the surgery + eval toolkit
The vLLM and NVIDIA ModelOpt teams for the NVFP4A16 serving / quantization pipeline

Downloads last month: 47

Safetensors

Model size

11B params

Tensor type

BF16

F8_E4M3

Model tree for ManniX-ITA/gemma-4-A4B-98e-v5-coder-NVFP4A16

Base model

google/gemma-4-26B-A4B

Finetuned

google/gemma-4-26B-A4B-it

Finetuned

ManniX-ITA/gemma-4-A4B-98e-v4-it

Finetuned

ManniX-ITA/gemma-4-A4B-98e-v5-coder-it

Quantized

(2)

this model