gemma-4-31B-he1-it — v2 (2026-05-17)

Partial head-prune of google/gemma-4-31B-it: 12.5% of Q-heads removed in the first 4 sliding-attention layers (L0–L3), with an lstsq heal of those layers' O-projections. L4–L59 are byte-identical to the base. The model matches base-model quality on HumanEval and MBPP under chat-completions evaluation; the early-layer perturbation is what the local rebuild captured, and the gains come from the heal solution found on those 4 layers.

The build is fully reproducible via OmniMergeKit (recipes/gemma4_31b/prune_local_heal.py).

What's new in v2

This is a new local CPU rebuild of the same partial-prune recipe, replacing the original pod-built v1 weights.

What's actually different from base (verified via per-tensor diff against google/gemma-4-31B-it):

Tensor Layers changed Layers unchanged
self_attn.q_proj.weight 0, 1, 2, 3 4 – 59 (all 56)
self_attn.o_proj.weight 0, 1, 2, 3 4 – 59 (all 56)
self_attn.{k_proj, v_proj}.weight (none) 0 – 59
FFN gate/up/down, norms, embed, lm_head (none) 0 – 59

pruned_layers and refit_stats inside prune_manifest.json list all 60 layers because the recipe computed the lstsq math for every layer; only L0–L3's q_proj/o_proj were actually written into the saved safetensors (L4–L59 kept their base values). v1 had the same L0–L3-only footprint; v2 differs from v1 only on q_proj/o_proj of L0–L3 (slightly different lstsq solutions arising from the local CPU calibration capture).

SHA256 evidence:

  • model-00001-of-00002.safetensors: v2 fdcaa60e…1a409ae9dd vs v1 acce9a7a…06771a9dcdiffers (L0–L3 q/o_proj).
  • model-00002-of-00002.safetensors: v2 2bdb2df0…ae2b6b7d = v1 — byte-identical; all tensors in this shard are byte-identical to the base, the file just carries a larger header due to embedded build metadata.

Quality

llama-server Q4_K_M legacy eval (v1 cycle; v2 9-bench NVFP4A16 re-run results are in the canonical 9-bench cohort table below):

  • HumanEval-chat: 98.78% (164 q) — v2 local rebuild, +1.22 pp vs v1's 97.56%, +0.61 pp vs base 98.17%.
  • MBPP-chat: 85.20% (500 q) — local rebuild, −0.40 pp vs v1's 85.60%, +0.20 pp vs base 85.00%.

The deltas are within the published per-bench ±1.5 pp CI; treat as "matches base, doesn't regress."

v2 eval status (2026-05-19)

  • NVFP4A16 9-bench canonical eval under the current methodology (vLLM 0.20.2 stock + --reasoning-parser gemma4 + thinking_token_budget=12288 + Fix-A reasoning_content fallback, greedy sampler) is complete: all 9/9 benches landed (see the cohort table below). HumanEval+ landed at 93.90 % (new best in cohort, +1.22 pp over v5-coder's 92.68 %) and LCB-medium-55 at 96.36 % (tied with base 31B-it for best in cohort).
  • Republish GGUF + NVFP4A16 quants at the -GGUF sister repo (imatrix on calibration-v5 corpus, updated imatrix.dat) — still pending the canonical eval pass on the new weights.

The Q4_K_M legacy table below remains the v1 reference; the canonical 9-bench table is the authoritative source for this v2 release.

TL;DR

  • What: L0–L3-only Q-head prune at 12.5% on Gemma 4 31B-it. L4–L59 are byte-identical to google/gemma-4-31B-it.
  • How: Per-head leave-one-out importance ranking on a teacher-force calibration set across all layers; mask-mode head dropping (no reshape); lstsq O-projection heal computed on every layer but only L0–L3's q_proj/o_proj written into the saved safetensors (L4–L59's healed weights were rolled back to base after the full-prune canary failed).
  • Quality: Matches the unpruned base on HumanEval-chat and MBPP-chat under chat-completions Q4_K_M (legacy llama-server eval). v2 HE 98.78% (+0.61 pp vs base), MBPP 85.20% (+0.20 pp vs base) — both inside the per-bench ±1.5 pp CI.
  • Effective parameter savings: 0.83% in attention only (4 layers × 12.5% Q-prune / 60 layers). Embed / FFN / norms unchanged. The intent of the original v1 was a full 12.5% prune across all 60 layers (13% attention savings); that build's canary failed, and only the L0–L3 changes survived into the saved checkpoint. Treat this repo as a partial-prune research artifact, not as the full-prune compression model originally targeted.

Code Benchmarks — chat-completions @ Q4_K_M (v1 reference; v2 re-run pending)

Evaluated via llama-server (build b9095-2-g0b04728) + lm-evaluation-harness chat-completions endpoint with --reasoning-format deepseek --reasoning-budget 8192, greedy sampler (temp=0), q8_0 KV cache, --parallel 2. Mandatory --use_cache SQLite + --log_samples per project protocol.

Model HumanEval-chat (164) MBPP-chat (500)
google/gemma-4-31B-it (base) Q4_K_M 98.17% ±1.05 85.00% ±1.60
gemma-4-31B-he1-it v1 (replaced by this upload) Q4_K_M 97.56% ±1.21 85.60% ±1.57
gemma-4-31B-he1-it v2 (this model — local rebuild) Q4_K_M 98.78% ±0.86 85.20% ±1.59

Both v1 and v2 are L0–L3-only modifications. The published v1 was originally framed as a full-60-layer prune; per-tensor diff against the base shows that framing was incorrect for the actual saved weights. v2 ships with the same effective footprint and a corrected card.

These llama.cpp Q4_K_M chat-completions numbers are the legacy eval cycle; the full re-run under the current canonical methodology (Gemma 4 reasoning parser + thinking budget 12288) landed on 2026-05-19 — see the canonical 9-bench cohort table below. A vLLM-4bit cross-validation was attempted at v1 time but blocked on upstream vLLM gaps (Gemma 4 31B heterogeneous-head QKV split bug in vLLM ≤ 0.19; vLLM 0.20+ requires CUDA driver and glibc combinations not available on the eval hosts at build time). The v2 NVFP4A16 quant in the sister repo clears that path.

Canonical 9-bench NVFP4A16 cohort — Gemma 4 dense + MoE comparison

The current canonical Gemma 4 cohort uses vLLM 0.20.2 stock + --reasoning-parser gemma4 + thinking_token_budget=12288 with a greedy sampler (T=0, top_p=1, top_k=0, do_sample=False) — see omnimergekit/eval/EVAL_PROTOCOL.md. The 128e/v4 columns reuse the published baseline from the v5-it card.

Bench (n) 128e ref 98e v4 98e v5 98e v5-coder 31B-it (base) 31B-he1 (this model)
GPQA Diamond (198) 73.23 % 69.19 % 68.18 % 68.69 % 81.31 % 84.34 %
GSM8K-100 91.00 % 86.00 % 91.00 % 86.00 % 93.00 % 92.00 %
MATH-500-100 89.00 % 89.00 % 90.00 % 92.00 % 97.00 % 95.00 %
AIME 2024 (30) 73.33 %† 66.67 %† 70.00 % 36.67 %‡ 76.67 % 76.67 %
IFEval-100 (prompt_strict) 95.00 % 93.00 % 89.00 % 94.00 % 96.00 % 96.00 %
HumanEval-164 chat 96.95 % 96.95 % 93.29 % 98.17 % 97.56 % 98.17 %
HumanEval+ chat (164) 92.07 % 91.46 % 87.20 % 92.68 % 92.07 % 93.90 %
LCB-medium-55 (v4 split) 87.27 % 78.18 % 80.00 % 85.45 % 96.36 % 96.36 %
ARC-Challenge chat (1172) 95.99 % 95.99 % 95.82 % 95.31 % 98.04 % 97.61 %

Bold = best in row. Sources: 128e and 98e v4 reuse the published baseline from the v5-it card; 98e v5 from solidpc 3090 9-bench (all 9 final on stack-pinned vLLM); 98e v5-coder from its own card; 31B-it (base) and 31B-he1 from L40 pod 37006213. All on vLLM 0.20.2 stock + --reasoning-parser gemma4 + thinking_token_budget=12288 + Fix-A reasoning_content fallback, greedy sampler. ARC numbers were rescored on stack-pinned hosts after the Fix-A patch to recover responses that vLLM had emitted as empty content with the answer stranded in reasoning_content.

† AIME 2024 correction (2026-05-21): the previously published 128e (36.67 %) and v4 (36.67 %) AIME values were produced against aime24 (non-chat) on an earlier vLLM dev build that under-scored AIME for the 26B-A4B class. Re-eval against the aime24_chat shadow task on stack-pinned stock vLLM 0.20.2 gives 128e = 73.33 % (22/30) and v4 = 66.67 % (20/30). These are the values reflected above.

‡ v5-coder AIME = 36.67 % is from the same earlier-stack run and is currently being re-evaluated on the canary-verified stack (T79). The corrected value will be published on the v5-coder card when the re-eval lands.

Methodology: this cohort is now anchored to EVAL_PROTOCOL v3 — stack lock + structural canary + reference anchor bench, see omnimergekit/eval/stack.lock.yaml and structural_canary.py. Any number published in this table passes the canary against stock vLLM 0.20.2 + Fix-A.

What the he1 column tells us

All 9 canonical benches landed for he1 on 2026-05-18 / -19. The picture against base 31B-it:

  • GPQA Diamond +3.03 pp (84.34 % vs 81.31 %) — the surprise gain. The L0–L3 partial prune (≈0.83 % attention savings) appears to slightly favour multi-domain reasoning paths rather than degrade them.
  • HumanEval+ 93.90 % (+1.83 pp vs base; new best in the cohort, +1.22 pp over v5-coder's 92.68 %) — the second surprise. The same partial-prune that didn't regress anything also lifts the harder HE+ split above every Gemma 4 variant tested so far.
  • LCB-medium-55 96.36 % (tied with base for best in cohort) — apples-to-apples with base on the hardest code split; well above every MoE variant.
  • AIME 2024 76.67 % (tied with base), IFEval 96.00 % (tied), HumanEval 98.17 % (+0.61 pp) — apples-to-apples on instruction-following and code-generation.
  • GSM8K −1.00 pp, MATH-500 −2.00 pp, ARC −0.43 pp — within their per-bench stderr (±1.7 to ±2.7 pp); not statistically distinguishable from base.

Practical reading: the L0–L3-only modification doesn't regress on any canonical bench, lifts GPQA Diamond by +3 pp, and lifts HumanEval+ by +1.83 pp — making he1 the strongest Gemma 4 variant in this cohort on the two hardest benches. All 9/9 benches are now complete.

Pruning Method

The mechanical recipe is implemented in recipes/gemma4_31b/prune_local_heal.py (omnimergekit) and runs in five phases. The phases below describe what the recipe attempted — the saved weights ultimately only retained the L0–L3 portion (see "What's new in v2" above).

  1. Phase 0 — Calibration capture. Teacher-force forward pass on a short instruction-following corpus; record per-head Q/K/V activations and the residual stream at each layer's attention entry point. Phase 0 runs at fp32 with no accelerate offload so the cache is bit-stable.
  2. Phase 1 — nf4 importance. Load the model in 4-bit, run an nf4 forward over the same prompts, compute a per-head leave-one-out importance metric (NLL delta when that head's contribution is zeroed). The top-12.5% lowest-importance heads per layer are tagged for removal across all layers (50 sliding-attention + 10 full-attention).
  3. Phase 2 — lstsq O-projection heal. For every layer, the kept-heads' attention output is set and O' = (full_attn_out)·pinv(kept_attn_out) solves for the post-attention residual on the calibration set in the least-squares sense. Ridge (ridge_rel=0.01) regularizes the pseudo-inverse. The manifest's refit_stats records this for all 60 layers (mean rel-resid 1.1%, max 3.8%).
  4. Phase 3a/b/c — materialize and save. Full-precision materialization on CPU, mask-mode (do not reshape num_attention_heads), staged safetensors save with the prune manifest embedded.
  5. Phase 2.5 — AR canary. Generation-mode (not teacher-force) sanity check against three known-good prompts. The original v1 full-60-layer canary failed (manifest carries canary_failed: True); the salvage path kept only L0–L3's q_proj/o_proj changes from the build, restoring L4–L59 to base. v2 re-ran the same partial pipeline locally on CPU and produced the same L0–L3-only saved footprint.

Why mask-mode and not reshape?

The Gemma 4 31B attention block has per-layer-type structure: sliding-attention layers use head_dim=256, n_kv=16 (GQA 2:1); full-attention layers use head_dim=512, n_kv=4 with attention_k_eq_v=True and no V projection. The per-layer-type Q/K/V/O shapes are intricate. Mask-mode keeps num_attention_heads global and zeros the pruned heads' contributions inside the O-projection, which:

  • avoids fighting LCM constraints on reshape (only Q=24/16/32 are LCM-legal at 12.5% target across all layer types),
  • leaves the model loadable by stock transformers with no config patches,
  • lets the lstsq heal target whatever surviving subspace gives the best post-attention residual.

A reshape variant (T16-Option-B at Q=24) is planned as a follow-up to compare on-disk savings.

The BOS gotcha (and why earlier canary runs failed)

tokenizer.__call__(prompt) does not prepend <bos> for Gemma 4 by default; llama.cpp /completion does. The pruned + healed weights are BOS-sensitive: without a leading <bos> they collapse into token loops (額額額, France is France is) even when the same weights decode perfectly under llama.cpp. The fix is a one-line prepend in capture_canary_baseline and is documented in the recipe README. The AR canary in this build had a stale cache that masked this bug for several iterations.

Files

  • model-00001-of-00002.safetensors, model-00002-of-00002.safetensors — bf16 weights (≈ 59 GB total). Of the 1188 tensors, 8 differ from the base model: q_proj and o_proj for layers 0, 1, 2, 3 (sliding-attention).
  • model.safetensors.index.json — shard index (byte-identical to the base; same shard split).
  • config.json — stock Gemma 4 31B-it config; num_attention_heads unchanged (mask-mode).
  • tokenizer.json, tokenizer_config.json, chat_template.jinja — copied from the base.
  • prune_manifest.json — recipe metadata. Note: pruned_layers and refit_stats list all 60 layers because the recipe computed importance + lstsq math for every layer; only L0–L3's q_proj/o_proj actually made it into the saved safetensors.

Usage

Transformers (bf16 / nf4)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
                        bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True)
model = AutoModelForCausalLM.from_pretrained(
    "ManniX-ITA/gemma-4-31b-he1-it",
    quantization_config=bnb, device_map={"": 0}, attn_implementation="eager",
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-31b-he1-it")

msgs = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
ids = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(ids, max_new_tokens=400, do_sample=False, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True))

llama.cpp (recommended for serving)

llama-server -m gemma-4-31B-he1-it-Q4_K_M.gguf \
    --port 8099 -c 32768 -ngl 99 --no-warmup --parallel 2 \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --reasoning-format deepseek --reasoning-budget 8192 \
    --temp 1.0 --top-p 0.95 --top-k 64

For greedy code-generation, replace the sampler with --temp 0 --top-p 1 --top-k 0 --reasoning off.

GGUF quants: ManniX-ITA/gemma-4-31b-he1-it-GGUF (v1 quants currently; v2 republish pending).

Recipe + Code

OmniMergeKit is the canonical home:

  • recipes/gemma4_31b/prune_local_heal.py — the full prune + heal + canary pipeline
  • scripts/replay_prune.sh — preflight / mlx neutralization / dependency pin wrapper for pod runs
  • eval/EVAL_PROTOCOL.md — canonical eval methodology (HE / MBPP / GPQA / LCB / MPE)

Related Models

Model Description
gemma-4-31b-he1-it-GGUF imatrix GGUF quants of this model (v1; v2 republish pending)

License

This model inherits the Gemma license from the base model.

Acknowledgements

  • Google for the base Gemma 4 31B-it model
  • bartowski for the calibration data v5 used in imatrix GGUF quantization
  • The OmniMergeKit project for the prune / heal / canary toolkit
Downloads last month
177
Safetensors
Model size
31B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ManniX-ITA/gemma-4-31b-he1-it

Finetuned
(169)
this model
Quantizations
2 models