gemma-4-A4B-109e-v3-GGUF

GGUF quantizations of ManniX-ITA/gemma-4-A4B-109e-v3-it.

All quants made using imatrix with calibration data v5.

Available Quantizations

Full precision

Filename	Quant	Size
`gemma-4-A4B-109e-v3-F16.gguf`	F16	40.72 GB

Standard bartowski-style quants (recommended for most users)

Filename	Quant	Size	Notes
`gemma-4-A4B-109e-v3-Q8_0.gguf`	Q8_0	21.65 GB	near-lossless, large
`gemma-4-A4B-109e-v3-Q6_K_L.gguf`	Q6_K_L	18.40 GB	Q6_K with Q8_0 embed/output
`gemma-4-A4B-109e-v3-Q6_K.gguf`	Q6_K	18.23 GB	very high quality
`gemma-4-A4B-109e-v3-Q5_K_L.gguf`	Q5_K_L	15.59 GB	Q5_K_M with Q8_0 embed/output
`gemma-4-A4B-109e-v3-Q5_K_M.gguf`	Q5_K_M	15.42 GB	high quality
`gemma-4-A4B-109e-v3-Q5_K_S.gguf`	Q5_K_S	14.51 GB	high quality
`gemma-4-A4B-109e-v3-Q4_K_L.gguf`	Q4_K_L	13.71 GB	Q4_K_M with Q8_0 embed/output
`gemma-4-A4B-109e-v3-Q4_K_M.gguf`	Q4_K_M	13.54 GB	recommended for most hardware
`gemma-4-A4B-109e-v3-Q4_K_S.gguf`	Q4_K_S	12.48 GB	slightly worse than Q4_K_M, a bit smaller
`gemma-4-A4B-109e-v3-Q4_1.gguf`	Q4_1	12.89 GB	legacy 4-bit
`gemma-4-A4B-109e-v3-Q4_0.gguf`	Q4_0	11.67 GB	legacy 4-bit
`gemma-4-A4B-109e-v3-IQ4_NL.gguf`	IQ4_NL	11.67 GB	i-quant, similar to Q4_0 but better
`gemma-4-A4B-109e-v3-IQ4_XS.gguf`	IQ4_XS	11.25 GB	i-quant, smaller than Q4_K_S
`gemma-4-A4B-109e-v3-Q3_K_XL.gguf`	Q3_K_XL	10.90 GB	Q3_K_L with Q8_0 embed/output
`gemma-4-A4B-109e-v3-Q3_K_L.gguf`	Q3_K_L	11.18 GB	OK quality at 3-bit
`gemma-4-A4B-109e-v3-Q3_K_M.gguf`	Q3_K_M	10.74 GB	lower quality
`gemma-4-A4B-109e-v3-Q3_K_S.gguf`	Q3_K_S	9.88 GB	low quality, not recommended
`gemma-4-A4B-109e-v3-IQ3_M.gguf`	IQ3_M	10.03 GB	i-quant, surprisingly good at 3-bit
`gemma-4-A4B-109e-v3-IQ3_XS.gguf`	IQ3_XS	9.41 GB	i-quant, aggressive
`gemma-4-A4B-109e-v3-IQ3_XXS.gguf`	IQ3_XXS	9.14 GB	i-quant, very aggressive
`gemma-4-A4B-109e-v3-IQ2_M.gguf`	IQ2_M	8.39 GB	i-quant, lossy but usable
`gemma-4-A4B-109e-v3-IQ2_S.gguf`	IQ2_S	7.99 GB	i-quant, low quality
`gemma-4-A4B-109e-v3-IQ2_XS.gguf`	IQ2_XS	7.94 GB	i-quant, very low quality
`gemma-4-A4B-109e-v3-IQ2_XXS.gguf`	IQ2_XXS	7.52 GB	i-quant, smallest — expect quality loss

ContribDynamic (CD) experimental quants

Per-layer dynamic quantization driven by our clean fp32 expert-contribution analysis: tensors in high-importance layers (L0, L1-6, L10) get higher precision; low-importance layers (L11-29) get lower precision. Inspired by Unsloth's Dynamic (UD) approach but using our own profiling data. See the CD block below for per-layer details.

Filename	Quant	Size	Notes
`gemma-4-A4B-109e-v3-CD-Q6_K.gguf`	CD-Q6_K	15.80 GB	hybrid Q6_K / Q5_K / Q4_K per layer — same footprint as Q5_K_L, Q6_K quality in the layers that matter
`gemma-4-A4B-109e-v3-CD-Q5_K_M.gguf`	CD-Q5_K_M	13.51 GB	hybrid Q5_K / Q4_K / Q3_K per layer
`gemma-4-A4B-109e-v3-CD-Q4_K_M.gguf`	CD-Q4_K_M	11.08 GB	hybrid Q4_K / Q3_K per layer
`gemma-4-A4B-109e-v3-CD-Q3_K_M.gguf`	CD-Q3_K_M	10.37 GB	hybrid Q3_K / Q2_K per layer

CD-Q2_K was attempted but not shipped — the low-tier IQ2_S assignment in its tensor-type map requires --imatrix at quantize time, and the initial build pipeline missed that guard. The bug is now fixed in scripts/quantize_gguf.py for future builds.

How to Use

With llama.cpp:

llama-server -m gemma-4-A4B-109e-v3-Q4_K_M.gguf -c 8192 -ngl 99 \
    --reasoning-format deepseek --reasoning-budget 16384

Important: always pass --reasoning-format deepseek --reasoning-budget 16384 when serving Gemma 4. Without the budget, chemistry-heavy prompts trigger unbounded chain-of-thought that crashes the server. See the inference settings section below.

With ollama:

ollama run hf.co/ManniX-ITA/gemma-4-A4B-109e-v3-it-GGUF:CD-Q6_K

This repo ships sidecar files Ollama reads on pull:

template — Gemma 4 NATIVE tool-call format: <|tool_call>call:NAME{...}<tool_call|> with <|"|>...<|"|> string delimiters.
params — repeat_penalty 1.15 (stops duplicate-block loops), stops on <turn|> / <|tool_response>, temperature 0.6, top_p 0.95, num_ctx 131072.

For tool-use / function-calling, Ollama's built-in RENDERER gemma4 / PARSER gemma4 directives are required — these aren't auto-applied to HF pulls. Wrap the pulled model once:

ollama pull hf.co/ManniX-ITA/gemma-4-A4B-109e-v3-it-GGUF:CD-Q6_K
cat <<'EOF' | ollama create gemma4-109e-it -f -
FROM hf.co/ManniX-ITA/gemma-4-A4B-109e-v3-it-GGUF:CD-Q6_K
RENDERER gemma4
PARSER gemma4
EOF
ollama run gemma4-109e-it

After wrapping, ollama show gemma4-109e-it reports capabilities [completion, tools, thinking] and tool calls are parsed into the structured tool_calls response field instead of appearing as raw tokens in content.

Sampler note: without repeat_penalty >= 1.1, Gemma 4 tool-use can loop and emit hundreds of duplicate <tool_call> blocks per response. The params file sets 1.15 — keep it.

Original Model Card

Gemma 4 A4B 109-Expert v3 (22.4B)

Expert-pruned version of google/gemma-4-26B-A4B-it, reduced from 128 to 109 experts per MoE layer using a clean fp32 teacher-force analysis on GPQA Diamond.

	Original (128e)	This model (109e v3)	Delta
Total params	26B	22.4B	-14%
Experts per layer	128	109	-19/layer
Top-k routing	8	8	—
GGUF Q6_K size	18 GB	18 GB	—
bf16 size	50 GB	42 GB	-16%
GPQA Diamond (Q6_K, lm-eval + patches)	81.82%	80.30%	−1.52 pp

The v3 clean teacher-force drop only costs 1.52 percentage points vs the full 128-expert reference at Q6_K.

Why v3?

Earlier 109e / 109e-v2 releases were built from a drop map that turned out to have ~43% wrong expert selections due to a bf16 → .norm() overflow bug in the teacher-force analysis script. The bug produced NaN/inf entries in the per-expert norms of layers 11-29 and corrupted the ranking used to pick which experts to drop.

v3 fixes it:

.float().norm() for the bf16→fp32 reduction (eliminates overflow)
NaN guard on hidden states (skip upstream-NaN tokens)
4096-token truncation on calibration inputs
attn_implementation="eager" (Gemma 4 head_dim=512 is not supported by FlashAttention 2)
Analysis re-run over the full GPQA Diamond 196-question set on the 128e reference, producing a new clean drop map (teacher_force_109e_p16_clean.json).

The v3 clean drop map differs on roughly 43% of expert selections compared to v2 — it is effectively a different model. And at Q6_K it beats v2 by +1.51 pp on GPQA Diamond (80.30% vs 78.79%).

Pruning Method

Teacher-Force Expert Analysis

The pruning decision is based on measuring the actual contribution of each expert during teacher-forced inference on GPQA Diamond prompts using the full 128-expert reference model as the teacher.

Process (scripts/teacher_force_analysis.py):

Teacher passes: For each of 196 GPQA Diamond questions, run the 128e reference model through the complete prompt + correct CoT + answer sequence in a single forward pass with teacher forcing.
Per-expert norms: At every MoE layer, hook the experts module and recompute each activated expert's output ||routing_weight · expert_output||_2 on GPU, aggregated in fp32.
Per-question top-16 protection: For each question independently, rank experts per layer by weighted norm and mark the top 16 as "protected for this question".
Aggregate across questions: Union the per-question top-16 sets across all 196 GPQA questions. The top 109 experts per layer by coverage are kept; the bottom 19 are dropped.

This gives a drop map specifically optimized for the GPQA Diamond task domain while remaining deterministic and reproducible.

Key Findings (clean fp32 analysis)

Experts are NOT topic-specialized: confirmed across domains.
The bf16 bug mattered: ~2% of per-expert norm entries in layers 11-29 were NaN or inf in the corrupted analysis, dragging those experts to artificially extreme ranks. Fixing to fp32 changes 43% of the "protected top-16 per question" decisions.
Expert weight similarity is near zero: cosine similarity between expert weight matrices maxes at ~0.05 — merging experts by averaging destroys the model. Expert dropping (what we do here) is the only viable structural compression.

Pruning Decision

Uniform 109 experts per layer (19 dropped per layer), based on the clean teacher-force aggregate ranking. The router proj.weight is resized from [128, hidden] to [109, hidden], keeping only the rows corresponding to retained experts. The router naturally adapts — removed experts simply become unavailable and the top-8 selection falls through to the next-best available expert. No fine-tuning needed.

The drop map used to build this model is deterministic and stored in expert_drop_metadata.json.

GPQA Diamond Evaluation

Setup

All variants are evaluated with the same canonical script (scripts/eval_gpqa_v3.sh) for apples-to-apples comparison:

Quantization: GGUF Q6_K via llama.cpp llama-quantize (imatrix calibration)
Inference: llama.cpp llama-server (OpenAI-compatible API)
Evaluation: lm-evaluation-harness, task gpqa_diamond_cot_zeroshot
Backend: local-chat-completions against llama.cpp API
GPU: NVIDIA RTX 3090 (24 GB), 99 layers offloaded

Configuration (locked, all variants identical)

Parameter	Value
Context size	32768 tokens
Reasoning format	`deepseek` (separates thinking into `reasoning_content`)
Reasoning budget	16384 tokens
max_gen_toks	24576
Temperature	1.0 (Gemma 4 official sampling)
top_p	0.95
top_k	64
Seed	42
DRY multiplier	0.5 (anti-degenerate-loop sampler, proven to fix Q53 "re-re-re" crash)
Tokenizer	`google/gemma-4-26B-A4B-it` (original, unmodified)

The reasoning budget is critical: without it, Gemma 4 enters overthinking loops on hard questions and exhausts the full context without committing to an answer. This is base-model behavior — the 128-expert reference does the same.

Results (Q6_K, full 198-question GPQA Diamond)

Model	Drop map	Score	flex-extract %	vs 128e
gemma-4-26B-A4B-it (reference)	—	162/198	81.82%	—
gemma-4-A4B-109e-v3 (this)	clean fp32 teacher-force	159/198	80.30%	−1.52 pp
gemma-4-A4B-109e-v2 (old v2)	corrupted bf16 teacher-force	156/198	78.79%	−3.03 pp
gemma-4-A4B-120e-v4	corrupted bf16 teacher-force	154/198	77.78%	−4.04 pp
gemma-4-A4B-120e-v3	old greedy	152/198	76.77%	−5.05 pp
gemma-4-A4B-109e (older)	old greedy contribution	148/198	74.75%	−7.07 pp

Patch Methodology

The raw lm-eval run on this model scored 154/198 = 77.78%. 10 questions returned empty or truncated responses because of a llama.cpp PEG parser bug (upstream issue #21418, merged but not fully fixed) that triggers on some chemistry/reasoning questions when reasoning-format deepseek is active.

These 10 were re-run using the llama.cpp /completion endpoint with a short prefilled continuation ("The answer is (") at n_predict=10. This bypasses the reasoning parser entirely — the model commits to a single letter in 1-2 tokens, with no channel/thought markers to trip up the parser.

Patch result: 5 of 10 were correct (expected 2.5 if random; the model has real signal on the "missing" questions when given a path to answer them). Final patched score: 159/198 = 80.30%.

The same patch method was applied consistently to all compared variants above (128e, 109e-v2, 120e-v4) — so the ranking is fair.

Wrong-answer breakdown (Q6_K, flexible-extract)

128 correct from raw lm-eval + 5 patched = 159 correct
29 wrong
10 would-be-invalid → patched (5 correct, 5 wrong)

Architecture

Unchanged from the original except num_experts: 109 (was 128):

Layers: 30
Hidden size: 2816
Expert intermediate size: 704 per expert
Dense MLP intermediate size: 2112 (always active)
Top-k routing: 8
Attention: Hybrid sliding (5) + global (1) pattern, head_dim=512 (requires attn_implementation="eager" — FlashAttention 2 does not support head_dim > 256)
Vocabulary: 262,144

Files

config.json — Model config with num_experts: 109
model-0000N-of-00009.safetensors — Model weights (9 shards, 41.7 GB total bf16)
expert_drop_metadata.json — Per-layer keep/drop expert indices and methodology
tokenizer.json / tokenizer_config.json / chat_template.jinja — from the base 26B-A4B-it (unchanged)

Usage

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "ManniX-ITA/gemma-4-A4B-109e-v3-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",  # Gemma 4 head_dim=512 is not supported by FA2
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-A4B-109e-v3-it")

msgs = [{"role": "user", "content": "Explain the Heisenberg uncertainty principle."}]
inputs = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=1.0, top_p=0.95, top_k=64)
print(tok.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))

llama.cpp (recommended for consumer hardware)

GGUF quantizations are available at ManniX-ITA/gemma-4-A4B-109e-v3-it-GGUF. The Q6_K quant fits comfortably in ~19 GB VRAM and was used for all benchmarks above.

llama-server -m gemma-4-A4B-109e-v3-Q6_K.gguf \
    --port 8099 -c 32768 -ngl 99 \
    --reasoning-format deepseek --reasoning-budget 16384 \
    --temp 1.0 --top-p 0.95 --top-k 64 --dry-multiplier 0.5 --seed 42

Or convert locally:

python llama.cpp/convert_hf_to_gguf.py gemma-4-A4B-109e-v3-it --outfile model-f16.gguf --outtype f16
llama.cpp/build/bin/llama-quantize model-f16.gguf model-Q6_K.gguf Q6_K

Reproduction

The full pipeline is deterministic and bit-reproducible. Same base model + same drop map + same script = bit-identical safetensors (verified across two independent rebuilds at different sites, all 9 shards SHA256-matched):

scripts/teacher_force_analysis.py — Clean fp32 per-expert contribution analysis on GPQA Diamond
scripts/generate_drop_map.py — Aggregate per-question top-16 protections into a global drop map
scripts/expert_drop.py — Deterministic expert pruning from the drop map
scripts/eval_gpqa_v3.sh — Canonical locked-methodology evaluation via llama.cpp + lm-eval

License

This model inherits the Gemma license from the base model.

Acknowledgements

Google for the base Gemma 4 26B-A4B-it model
The GPQA Diamond benchmark (Rein et al., 2023)
bartowski for the calibration data v5 used in imatrix-based GGUF quantization

Downloads last month: 18,573

GGUF

Model size

22B params

Architecture

gemma4

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-A4B-109e-v3-it-GGUF

Base model

google/gemma-4-26B-A4B-it

Finetuned

ManniX-ITA/gemma-4-A4B-109e-v3-it

Quantized

(2)

this model