gemma-4-A4B-109e-v3-GGUF

GGUF quantizations of ManniX-ITA/gemma-4-A4B-109e-v3-it.

All quants made using imatrix with calibration data v5.

Available Quantizations

Full precision

Filename Quant Size
gemma-4-A4B-109e-v3-F16.gguf F16 40.72 GB

Standard bartowski-style quants (recommended for most users)

Filename Quant Size Notes
gemma-4-A4B-109e-v3-Q8_0.gguf Q8_0 21.65 GB near-lossless, large
gemma-4-A4B-109e-v3-Q6_K_L.gguf Q6_K_L 18.40 GB Q6_K with Q8_0 embed/output
gemma-4-A4B-109e-v3-Q6_K.gguf Q6_K 18.23 GB very high quality
gemma-4-A4B-109e-v3-Q5_K_L.gguf Q5_K_L 15.59 GB Q5_K_M with Q8_0 embed/output
gemma-4-A4B-109e-v3-Q5_K_M.gguf Q5_K_M 15.42 GB high quality
gemma-4-A4B-109e-v3-Q5_K_S.gguf Q5_K_S 14.51 GB high quality
gemma-4-A4B-109e-v3-Q4_K_L.gguf Q4_K_L 13.71 GB Q4_K_M with Q8_0 embed/output
gemma-4-A4B-109e-v3-Q4_K_M.gguf Q4_K_M 13.54 GB recommended for most hardware
gemma-4-A4B-109e-v3-Q4_K_S.gguf Q4_K_S 12.48 GB slightly worse than Q4_K_M, a bit smaller
gemma-4-A4B-109e-v3-Q4_1.gguf Q4_1 12.89 GB legacy 4-bit
gemma-4-A4B-109e-v3-Q4_0.gguf Q4_0 11.67 GB legacy 4-bit
gemma-4-A4B-109e-v3-IQ4_NL.gguf IQ4_NL 11.67 GB i-quant, similar to Q4_0 but better
gemma-4-A4B-109e-v3-IQ4_XS.gguf IQ4_XS 11.25 GB i-quant, smaller than Q4_K_S
gemma-4-A4B-109e-v3-Q3_K_XL.gguf Q3_K_XL 10.90 GB Q3_K_L with Q8_0 embed/output
gemma-4-A4B-109e-v3-Q3_K_L.gguf Q3_K_L 11.18 GB OK quality at 3-bit
gemma-4-A4B-109e-v3-Q3_K_M.gguf Q3_K_M 10.74 GB lower quality
gemma-4-A4B-109e-v3-Q3_K_S.gguf Q3_K_S 9.88 GB low quality, not recommended
gemma-4-A4B-109e-v3-IQ3_M.gguf IQ3_M 10.03 GB i-quant, surprisingly good at 3-bit
gemma-4-A4B-109e-v3-IQ3_XS.gguf IQ3_XS 9.41 GB i-quant, aggressive
gemma-4-A4B-109e-v3-IQ3_XXS.gguf IQ3_XXS 9.14 GB i-quant, very aggressive
gemma-4-A4B-109e-v3-IQ2_M.gguf IQ2_M 8.39 GB i-quant, lossy but usable
gemma-4-A4B-109e-v3-IQ2_S.gguf IQ2_S 7.99 GB i-quant, low quality
gemma-4-A4B-109e-v3-IQ2_XS.gguf IQ2_XS 7.94 GB i-quant, very low quality
gemma-4-A4B-109e-v3-IQ2_XXS.gguf IQ2_XXS 7.52 GB i-quant, smallest — expect quality loss

ContribDynamic (CD) experimental quants

Per-layer dynamic quantization driven by our clean fp32 expert-contribution analysis: tensors in high-importance layers (L0, L1-6, L10) get higher precision; low-importance layers (L11-29) get lower precision. Inspired by Unsloth's Dynamic (UD) approach but using our own profiling data. See the CD block below for per-layer details.

Filename Quant Size Notes
gemma-4-A4B-109e-v3-CD-Q6_K.gguf CD-Q6_K 15.80 GB hybrid Q6_K / Q5_K / Q4_K per layer — same footprint as Q5_K_L, Q6_K quality in the layers that matter
gemma-4-A4B-109e-v3-CD-Q5_K_M.gguf CD-Q5_K_M 13.51 GB hybrid Q5_K / Q4_K / Q3_K per layer
gemma-4-A4B-109e-v3-CD-Q4_K_M.gguf CD-Q4_K_M 11.08 GB hybrid Q4_K / Q3_K per layer
gemma-4-A4B-109e-v3-CD-Q3_K_M.gguf CD-Q3_K_M 10.37 GB hybrid Q3_K / Q2_K per layer

CD-Q2_K was attempted but not shipped — the low-tier IQ2_S assignment in its tensor-type map requires --imatrix at quantize time, and the initial build pipeline missed that guard. The bug is now fixed in scripts/quantize_gguf.py for future builds.

How to Use

With llama.cpp:

llama-server -m gemma-4-A4B-109e-v3-Q4_K_M.gguf -c 8192 -ngl 99 \
    --reasoning-format deepseek --reasoning-budget 16384

Important: always pass --reasoning-format deepseek --reasoning-budget 16384 when serving Gemma 4. Without the budget, chemistry-heavy prompts trigger unbounded chain-of-thought that crashes the server. See the inference settings section below.

With ollama:

ollama run hf.co/ManniX-ITA/gemma-4-A4B-109e-v3-it-GGUF:CD-Q6_K

This repo ships sidecar files Ollama reads on pull:

  • template — Gemma 4 NATIVE tool-call format: <|tool_call>call:NAME{...}<tool_call|> with <|"|>...<|"|> string delimiters.
  • paramsrepeat_penalty 1.15 (stops duplicate-block loops), stops on <turn|> / <|tool_response>, temperature 0.6, top_p 0.95, num_ctx 131072.

For tool-use / function-calling, Ollama's built-in RENDERER gemma4 / PARSER gemma4 directives are required — these aren't auto-applied to HF pulls. Wrap the pulled model once:

ollama pull hf.co/ManniX-ITA/gemma-4-A4B-109e-v3-it-GGUF:CD-Q6_K
cat <<'EOF' | ollama create gemma4-109e-it -f -
FROM hf.co/ManniX-ITA/gemma-4-A4B-109e-v3-it-GGUF:CD-Q6_K
RENDERER gemma4
PARSER gemma4
EOF
ollama run gemma4-109e-it

After wrapping, ollama show gemma4-109e-it reports capabilities [completion, tools, thinking] and tool calls are parsed into the structured tool_calls response field instead of appearing as raw tokens in content.

Sampler note: without repeat_penalty >= 1.1, Gemma 4 tool-use can loop and emit hundreds of duplicate <tool_call> blocks per response. The params file sets 1.15 — keep it.


Original Model Card

Gemma 4 A4B 109-Expert v3 (22.4B)

Expert-pruned version of google/gemma-4-26B-A4B-it, reduced from 128 to 109 experts per MoE layer using a clean fp32 teacher-force analysis on GPQA Diamond.

Original (128e) This model (109e v3) Delta
Total params 26B 22.4B -14%
Experts per layer 128 109 -19/layer
Top-k routing 8 8
GGUF Q6_K size 18 GB 18 GB
bf16 size 50 GB 42 GB -16%
GPQA Diamond (Q6_K, lm-eval + patches) 81.82% 80.30% −1.52 pp

The v3 clean teacher-force drop only costs 1.52 percentage points vs the full 128-expert reference at Q6_K.

Why v3?

Earlier 109e / 109e-v2 releases were built from a drop map that turned out to have ~43% wrong expert selections due to a bf16 → .norm() overflow bug in the teacher-force analysis script. The bug produced NaN/inf entries in the per-expert norms of layers 11-29 and corrupted the ranking used to pick which experts to drop.

v3 fixes it:

  • .float().norm() for the bf16→fp32 reduction (eliminates overflow)
  • NaN guard on hidden states (skip upstream-NaN tokens)
  • 4096-token truncation on calibration inputs
  • attn_implementation="eager" (Gemma 4 head_dim=512 is not supported by FlashAttention 2)
  • Analysis re-run over the full GPQA Diamond 196-question set on the 128e reference, producing a new clean drop map (teacher_force_109e_p16_clean.json).

The v3 clean drop map differs on roughly 43% of expert selections compared to v2 — it is effectively a different model. And at Q6_K it beats v2 by +1.51 pp on GPQA Diamond (80.30% vs 78.79%).

Pruning Method

Teacher-Force Expert Analysis

The pruning decision is based on measuring the actual contribution of each expert during teacher-forced inference on GPQA Diamond prompts using the full 128-expert reference model as the teacher.

Process (scripts/teacher_force_analysis.py):

  1. Teacher passes: For each of 196 GPQA Diamond questions, run the 128e reference model through the complete prompt + correct CoT + answer sequence in a single forward pass with teacher forcing.
  2. Per-expert norms: At every MoE layer, hook the experts module and recompute each activated expert's output ||routing_weight · expert_output||_2 on GPU, aggregated in fp32.
  3. Per-question top-16 protection: For each question independently, rank experts per layer by weighted norm and mark the top 16 as "protected for this question".
  4. Aggregate across questions: Union the per-question top-16 sets across all 196 GPQA questions. The top 109 experts per layer by coverage are kept; the bottom 19 are dropped.

This gives a drop map specifically optimized for the GPQA Diamond task domain while remaining deterministic and reproducible.

Key Findings (clean fp32 analysis)

  • Experts are NOT topic-specialized: confirmed across domains.
  • The bf16 bug mattered: ~2% of per-expert norm entries in layers 11-29 were NaN or inf in the corrupted analysis, dragging those experts to artificially extreme ranks. Fixing to fp32 changes 43% of the "protected top-16 per question" decisions.
  • Expert weight similarity is near zero: cosine similarity between expert weight matrices maxes at ~0.05 — merging experts by averaging destroys the model. Expert dropping (what we do here) is the only viable structural compression.

Pruning Decision

Uniform 109 experts per layer (19 dropped per layer), based on the clean teacher-force aggregate ranking. The router proj.weight is resized from [128, hidden] to [109, hidden], keeping only the rows corresponding to retained experts. The router naturally adapts — removed experts simply become unavailable and the top-8 selection falls through to the next-best available expert. No fine-tuning needed.

The drop map used to build this model is deterministic and stored in expert_drop_metadata.json.

GPQA Diamond Evaluation

Setup

All variants are evaluated with the same canonical script (scripts/eval_gpqa_v3.sh) for apples-to-apples comparison:

  • Quantization: GGUF Q6_K via llama.cpp llama-quantize (imatrix calibration)
  • Inference: llama.cpp llama-server (OpenAI-compatible API)
  • Evaluation: lm-evaluation-harness, task gpqa_diamond_cot_zeroshot
  • Backend: local-chat-completions against llama.cpp API
  • GPU: NVIDIA RTX 3090 (24 GB), 99 layers offloaded

Configuration (locked, all variants identical)

Parameter Value
Context size 32768 tokens
Reasoning format deepseek (separates thinking into reasoning_content)
Reasoning budget 16384 tokens
max_gen_toks 24576
Temperature 1.0 (Gemma 4 official sampling)
top_p 0.95
top_k 64
Seed 42
DRY multiplier 0.5 (anti-degenerate-loop sampler, proven to fix Q53 "re-re-re" crash)
Tokenizer google/gemma-4-26B-A4B-it (original, unmodified)

The reasoning budget is critical: without it, Gemma 4 enters overthinking loops on hard questions and exhausts the full context without committing to an answer. This is base-model behavior — the 128-expert reference does the same.

Results (Q6_K, full 198-question GPQA Diamond)

Model Drop map Score flex-extract % vs 128e
gemma-4-26B-A4B-it (reference) 162/198 81.82%
gemma-4-A4B-109e-v3 (this) clean fp32 teacher-force 159/198 80.30% −1.52 pp
gemma-4-A4B-109e-v2 (old v2) corrupted bf16 teacher-force 156/198 78.79% −3.03 pp
gemma-4-A4B-120e-v4 corrupted bf16 teacher-force 154/198 77.78% −4.04 pp
gemma-4-A4B-120e-v3 old greedy 152/198 76.77% −5.05 pp
gemma-4-A4B-109e (older) old greedy contribution 148/198 74.75% −7.07 pp

Patch Methodology

The raw lm-eval run on this model scored 154/198 = 77.78%. 10 questions returned empty or truncated responses because of a llama.cpp PEG parser bug (upstream issue #21418, merged but not fully fixed) that triggers on some chemistry/reasoning questions when reasoning-format deepseek is active.

These 10 were re-run using the llama.cpp /completion endpoint with a short prefilled continuation ("The answer is (") at n_predict=10. This bypasses the reasoning parser entirely — the model commits to a single letter in 1-2 tokens, with no channel/thought markers to trip up the parser.

Patch result: 5 of 10 were correct (expected 2.5 if random; the model has real signal on the "missing" questions when given a path to answer them). Final patched score: 159/198 = 80.30%.

The same patch method was applied consistently to all compared variants above (128e, 109e-v2, 120e-v4) — so the ranking is fair.

Wrong-answer breakdown (Q6_K, flexible-extract)

  • 128 correct from raw lm-eval + 5 patched = 159 correct
  • 29 wrong
  • 10 would-be-invalid → patched (5 correct, 5 wrong)

Architecture

Unchanged from the original except num_experts: 109 (was 128):

  • Layers: 30
  • Hidden size: 2816
  • Expert intermediate size: 704 per expert
  • Dense MLP intermediate size: 2112 (always active)
  • Top-k routing: 8
  • Attention: Hybrid sliding (5) + global (1) pattern, head_dim=512 (requires attn_implementation="eager" — FlashAttention 2 does not support head_dim > 256)
  • Vocabulary: 262,144

Files

  • config.json — Model config with num_experts: 109
  • model-0000N-of-00009.safetensors — Model weights (9 shards, 41.7 GB total bf16)
  • expert_drop_metadata.json — Per-layer keep/drop expert indices and methodology
  • tokenizer.json / tokenizer_config.json / chat_template.jinja — from the base 26B-A4B-it (unchanged)

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "ManniX-ITA/gemma-4-A4B-109e-v3-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",  # Gemma 4 head_dim=512 is not supported by FA2
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-A4B-109e-v3-it")

msgs = [{"role": "user", "content": "Explain the Heisenberg uncertainty principle."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=1.0, top_p=0.95, top_k=64)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

llama.cpp (recommended for consumer hardware)

GGUF quantizations are available at ManniX-ITA/gemma-4-A4B-109e-v3-it-GGUF. The Q6_K quant fits comfortably in ~19 GB VRAM and was used for all benchmarks above.

llama-server -m gemma-4-A4B-109e-v3-Q6_K.gguf \
    --port 8099 -c 32768 -ngl 99 \
    --reasoning-format deepseek --reasoning-budget 16384 \
    --temp 1.0 --top-p 0.95 --top-k 64 --dry-multiplier 0.5 --seed 42

Or convert locally:

python llama.cpp/convert_hf_to_gguf.py gemma-4-A4B-109e-v3-it --outfile model-f16.gguf --outtype f16
llama.cpp/build/bin/llama-quantize model-f16.gguf model-Q6_K.gguf Q6_K

Reproduction

The full pipeline is deterministic and bit-reproducible. Same base model + same drop map + same script = bit-identical safetensors (verified across two independent rebuilds at different sites, all 9 shards SHA256-matched):

  1. scripts/teacher_force_analysis.py — Clean fp32 per-expert contribution analysis on GPQA Diamond
  2. scripts/generate_drop_map.py — Aggregate per-question top-16 protections into a global drop map
  3. scripts/expert_drop.py — Deterministic expert pruning from the drop map
  4. scripts/eval_gpqa_v3.sh — Canonical locked-methodology evaluation via llama.cpp + lm-eval

License

This model inherits the Gemma license from the base model.

Acknowledgements

  • Google for the base Gemma 4 26B-A4B-it model
  • The GPQA Diamond benchmark (Rein et al., 2023)
  • bartowski for the calibration data v5 used in imatrix-based GGUF quantization
Downloads last month
18,573
GGUF
Model size
22B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-109e-v3-it-GGUF

Quantized
(2)
this model