gemma-4-A4B-98e-v5-coder-NVFP4A16

NVFP4-A16 quantization (4-bit FP4 weights, BF16 activations) of ManniX-ITA/gemma-4-A4B-98e-v5-coder-it — the code-leaning 98e MoE variant of Gemma 4 26B-A4B.

13 GB on disk, runs on a single 24 GB GPU at 32 k context, native vLLM serving.

For the full design rationale, drop-map recipe, audit data, and 14–22B cohort comparison, see the BF16 base card. This card focuses on the NVFP4A16 quant itself and how to serve it.

Quant recipe

  • Quantizer: omnimergekit/scripts/quantize_any.py --method nvfp4a16
  • Engine: nvidia-modelopt==0.43.0 with NVFP4_DEFAULT_CFG (the pinned working version — 0.44.0 has two Gemma 4 regressions, see memory/feedback_modelopt_pin_0_43.md)
  • Calibration: 128 samples from tatsu-lab/alpaca, max_length=512
  • Excluded modules: *vision_tower*, *embed_vision*, *embed_audio* (kept BF16 for vLLM's gemma4_mm loader)
  • Producer hardware: vast.ai RTX 6000 Ada 48 GB pod, stack-pinned to vLLM 0.20.2 stock + Fix-A lm-eval patch
  • Producer stack: transformers 5.5.0, torch 2.10.0+cu128, safetensors 0.7.0
  • Date: 2026-05-17

Eval — canonical 9-bench cohort (NVFP4A16, vLLM 0.20.2 stock, greedy)

Greedy sampler (temperature=0, top_p=1, top_k=0, do_sample=False), thinking_token_budget=12288, max_gen_toks=16384. Bench definitions: omnimergekit/eval/templates/. 128e / v4 numbers reuse the published canonical baseline from the v5-it card.

Bench (n) 128e ref 98e v4 98e v5-coder (this model) 31B-it (dense) Δ (v5-coder − v4)
ARC-Challenge chat (1172) 95.99 % 95.99 % 95.31 % 98.04 % -0.68
GPQA Diamond (198) 73.23 % 69.19 % 68.69 % 81.31 % −0.50
GSM8K-100 91.00 % 86.00 % 86.00 % 93.00 % 0.00
MATH-500-100 89.00 % 89.00 % 92.00 % 97.00 % +3.00
AIME 2024 (30) 36.67 % 36.67 % 36.67 % 76.67 % 0.00
IFEval-100 (prompt_strict) 95.00 % 93.00 % 94.00 % 96.00 % +1.00
HumanEval-164 chat 96.95 % 96.95 % 98.17 % 97.56 % +1.22
HumanEval+ chat (164) 92.07 % 91.46 % 92.68 % 92.07 % +1.22
LCB-medium-55 (v4 split) 87.27 % 78.18 % 85.45 % (47/55) 96.36 % +7.27

Bold = best in 4-column cohort. ARC was rescored 2026-05-18 on stack-pinned solidpc (stock vLLM 0.20.2 + Fix-A patched lm-eval) and landed at 95.31 % ±0.62 pp, in-range with 128e (95.99 %) and v4 (95.99 %) and within stderr of v5 (96.59 %). The original ⚠ on this row was the pre-Fix-A silent-empty bug (87.6 % of responses had empty content; lm-eval's stock parse_generations didn't fall back to reasoning_content). See v5-coder bf16 card §audit for the per-bench taint analysis.

Key takeaways:

  • Code frontier at 14-22B: v5-coder beats every published 14-22B coder on HumanEval / HumanEval+ at this run (Qwen2.5-Coder-14B 89.6 / 87.2, DeepSeek-Coder-V2-Lite 81.1 / —, Codestral-22B 81.1 / —). Audit confirmed real, not memorized, not silent-empty.
  • +7.27 pp on LCB-medium vs v4 — the largest delta in the cohort, exactly where the C6 drop map was designed to help.
  • Math gains too: MATH-500 +3 pp vs v4, GSM8K flat. The C6 layer-relevance signal correlates with code reasoning more than v4's drop assumed.
  • Reasoning/general benches flat: GPQA −0.5, AIME 0.0, IFEval +1 — the recipe is a deliberate code rewrite of v4, not a generalist upgrade.

Usage with vLLM

python -m vllm.entrypoints.openai.api_server \
    --model ManniX-ITA/gemma-4-A4B-98e-v5-coder-NVFP4A16 \
    --served-model-name 98e_v5_coder_nvfp4a16 \
    --port 8099 \
    --gpu-memory-utilization 0.55 \
    --max-model-len 32768 \
    --max-num-batched-tokens 8192 \
    --dtype bfloat16 \
    --trust-remote-code \
    --reasoning-parser gemma4 \
    --default-chat-template-kwargs '{"enable_thinking": true}'

Per-request:

{"thinking_token_budget": 12288, "temperature": 0.0, "top_p": 1.0, "top_k": 0}

Gemma 4 MoE gotchas (baked into omnimergekit/scripts/pod_eval_31b.sh):

  • --max-num-batched-tokens 8192 is mandatory (Gemma 4 MM encoder budget; default 2048 < max_tokens_per_mm_item 2496 → boot crash).
  • --max-model-len ≥ 32768 is recommended for the canonical bench recipe (templates ask for max_gen_toks=49152; if you hit prompt+output > max-model-len, vLLM returns HTTP 400). Use 65536 if you have headroom.
  • preprocessor_config.json must exist in the model dir — present in this repo.

VRAM

GPU Runs? Memory (peak) Notes
RTX 3090 24 GB ~13 GB Single-GPU, 32 k ctx; throughput ~22 tok/s (Marlin, weight-only FP4)
RTX 4090 24 GB ~13 GB Same; faster than 3090
RTX 6000 Ada 48 GB ~13 GB Single-GPU comfortable, room for 65 k ctx
L40 / L40S 48 GB ~13 GB Sustained ~240 tok/s on canonical benches
A100 / H100 80 GB ~13 GB Native FP4 on H100 = much higher throughput
2× RTX 3090 / 4090 ~7 GB / GPU TP=2, longer ctx

Related models

Model Description
gemma-4-A4B-98e-v5-coder-it BF16 base — full design notes, audit, 14-22B cohort comparison
gemma-4-A4B-98e-v5-coder-it-GGUF GGUF tier sweep (Q2_K → Q8_0, IQ2-IQ4) + 5 ContribDynamic CD-* per-layer quants
gemma-4-A4B-98e-v5-it Sibling v5 (v4 + α=1.2 shared FFN, generalist)
gemma-4-A4B-98e-v5-NVFP4A16 Sibling v5 NVFP4A16
gemma-4-A4B-98e-v4-it Parent (multi-class CD-map drop, no α)
gemma-4-A4B-98e-v4-NVFP4A16 v4 NVFP4A16
Gemma-4-31B-it-NVFP4A16 Dense 31B baseline for cohort comparisons

Ollama

ollama pull mannix/gemma4-98e-v5-coder

The Ollama repo at mannix/gemma4-98e-v5-coder carries the GGUF tier sweep — same recipe, llama.cpp serving, no vLLM dependency.

Source

License

Inherits the Gemma license from the base model.

Acknowledgements

  • Google for the base Gemma 4 26B-A4B-it model
  • OmniMergeKit for the surgery + eval toolkit
  • The vLLM and NVIDIA ModelOpt teams for the NVFP4A16 serving / quantization pipeline
Downloads last month
47
Safetensors
Model size
11B params
Tensor type
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-98e-v5-coder-NVFP4A16