Instructions to use ManniX-ITA/gemma-4-31b-he1-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ManniX-ITA/gemma-4-31b-he1-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="ManniX-ITA/gemma-4-31b-he1-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("ManniX-ITA/gemma-4-31b-he1-it")
model = AutoModelForImageTextToText.from_pretrained("ManniX-ITA/gemma-4-31b-he1-it")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ManniX-ITA/gemma-4-31b-he1-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ManniX-ITA/gemma-4-31b-he1-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ManniX-ITA/gemma-4-31b-he1-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/ManniX-ITA/gemma-4-31b-he1-it

SGLang

How to use ManniX-ITA/gemma-4-31b-he1-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ManniX-ITA/gemma-4-31b-he1-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ManniX-ITA/gemma-4-31b-he1-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ManniX-ITA/gemma-4-31b-he1-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ManniX-ITA/gemma-4-31b-he1-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use ManniX-ITA/gemma-4-31b-he1-it with Docker Model Runner:
```
docker model run hf.co/ManniX-ITA/gemma-4-31b-he1-it
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

gemma-4-31B-he1-it — v2 (2026-05-17)

Partial head-prune of google/gemma-4-31B-it: 12.5% of Q-heads removed in the first 4 sliding-attention layers (L0–L3), with an lstsq heal of those layers' O-projections. L4–L59 are byte-identical to the base. The model matches base-model quality on HumanEval and MBPP under chat-completions evaluation; the early-layer perturbation is what the local rebuild captured, and the gains come from the heal solution found on those 4 layers.

The build is fully reproducible via OmniMergeKit (recipes/gemma4_31b/prune_local_heal.py).

What's new in v2

This is a new local CPU rebuild of the same partial-prune recipe, replacing the original pod-built v1 weights.

What's actually different from base (verified via per-tensor diff against google/gemma-4-31B-it):

Tensor	Layers changed	Layers unchanged
`self_attn.q_proj.weight`	0, 1, 2, 3	4 – 59 (all 56)
`self_attn.o_proj.weight`	0, 1, 2, 3	4 – 59 (all 56)
`self_attn.{k_proj, v_proj}.weight`	(none)	0 – 59
FFN gate/up/down, norms, embed, lm_head	(none)	0 – 59

pruned_layers and refit_stats inside prune_manifest.json list all 60 layers because the recipe computed the lstsq math for every layer; only L0–L3's q_proj/o_proj were actually written into the saved safetensors (L4–L59 kept their base values). v1 had the same L0–L3-only footprint; v2 differs from v1 only on q_proj/o_proj of L0–L3 (slightly different lstsq solutions arising from the local CPU calibration capture).

SHA256 evidence:

model-00001-of-00002.safetensors: v2 fdcaa60e…1a409ae9dd vs v1 acce9a7a…06771a9dc — differs (L0–L3 q/o_proj).
model-00002-of-00002.safetensors: v2 2bdb2df0…ae2b6b7d = v1 — byte-identical; all tensors in this shard are byte-identical to the base, the file just carries a larger header due to embedded build metadata.

Quality

llama-server Q4_K_M legacy eval (v1 cycle; v2 9-bench NVFP4A16 re-run results are in the canonical 9-bench cohort table below):

HumanEval-chat: 98.78% (164 q) — v2 local rebuild, +1.22 pp vs v1's 97.56%, +0.61 pp vs base 98.17%.
MBPP-chat: 85.20% (500 q) — local rebuild, −0.40 pp vs v1's 85.60%, +0.20 pp vs base 85.00%.

The deltas are within the published per-bench ±1.5 pp CI; treat as "matches base, doesn't regress."

v2 eval status (2026-05-19)

NVFP4A16 9-bench canonical eval under the current methodology (vLLM 0.20.2 stock + --reasoning-parser gemma4 + thinking_token_budget=12288 + Fix-A reasoning_content fallback, greedy sampler) is complete: all 9/9 benches landed (see the cohort table below). HumanEval+ landed at 93.90 % (new best in cohort, +1.22 pp over v5-coder's 92.68 %) and LCB-medium-55 at 96.36 % (tied with base 31B-it for best in cohort).
Republish GGUF + NVFP4A16 quants at the -GGUF sister repo (imatrix on calibration-v5 corpus, updated imatrix.dat) — still pending the canonical eval pass on the new weights.

The Q4_K_M legacy table below remains the v1 reference; the canonical 9-bench table is the authoritative source for this v2 release.

TL;DR

What: L0–L3-only Q-head prune at 12.5% on Gemma 4 31B-it. L4–L59 are byte-identical to google/gemma-4-31B-it.
How: Per-head leave-one-out importance ranking on a teacher-force calibration set across all layers; mask-mode head dropping (no reshape); lstsq O-projection heal computed on every layer but only L0–L3's q_proj/o_proj written into the saved safetensors (L4–L59's healed weights were rolled back to base after the full-prune canary failed).
Quality: Matches the unpruned base on HumanEval-chat and MBPP-chat under chat-completions Q4_K_M (legacy llama-server eval). v2 HE 98.78% (+0.61 pp vs base), MBPP 85.20% (+0.20 pp vs base) — both inside the per-bench ±1.5 pp CI.
Effective parameter savings: ~~0.83% in attention only (4 layers × 12.5% Q-prune / 60 layers). Embed / FFN / norms unchanged. The intent of the original v1 was a full 12.5% prune across all 60 layers (~~13% attention savings); that build's canary failed, and only the L0–L3 changes survived into the saved checkpoint. Treat this repo as a partial-prune research artifact, not as the full-prune compression model originally targeted.

Code Benchmarks — chat-completions @ Q4_K_M (v1 reference; v2 re-run pending)

Evaluated via llama-server (build b9095-2-g0b04728) + lm-evaluation-harness chat-completions endpoint with --reasoning-format deepseek --reasoning-budget 8192, greedy sampler (temp=0), q8_0 KV cache, --parallel 2. Mandatory --use_cache SQLite + --log_samples per project protocol.

Model	HumanEval-chat (164)	MBPP-chat (500)
`google/gemma-4-31B-it` (base) Q4_K_M	98.17% ±1.05	85.00% ±1.60
`gemma-4-31B-he1-it` v1 (replaced by this upload) Q4_K_M	97.56% ±1.21	85.60% ±1.57
`gemma-4-31B-he1-it` v2 (this model — local rebuild) Q4_K_M	98.78% ±0.86	85.20% ±1.59

Both v1 and v2 are L0–L3-only modifications. The published v1 was originally framed as a full-60-layer prune; per-tensor diff against the base shows that framing was incorrect for the actual saved weights. v2 ships with the same effective footprint and a corrected card.

These llama.cpp Q4_K_M chat-completions numbers are the legacy eval cycle; the full re-run under the current canonical methodology (Gemma 4 reasoning parser + thinking budget 12288) landed on 2026-05-19 — see the canonical 9-bench cohort table below. A vLLM-4bit cross-validation was attempted at v1 time but blocked on upstream vLLM gaps (Gemma 4 31B heterogeneous-head QKV split bug in vLLM ≤ 0.19; vLLM 0.20+ requires CUDA driver and glibc combinations not available on the eval hosts at build time). The v2 NVFP4A16 quant in the sister repo clears that path.

Canonical 9-bench NVFP4A16 cohort — Gemma 4 dense + MoE comparison

The current canonical Gemma 4 cohort uses vLLM 0.20.2 stock + --reasoning-parser gemma4 + thinking_token_budget=12288 with a greedy sampler (T=0, top_p=1, top_k=0, do_sample=False) — see omnimergekit/eval/EVAL_PROTOCOL.md. The 128e/v4 columns reuse the published baseline from the v5-it card.

Bench (n)	128e ref	98e v4	98e v5	98e v5-coder	31B-it (base)	31B-he1 (this model)
GPQA Diamond (198)	73.23 %	69.19 %	68.18 %	68.69 %	81.31 %	84.34 %
GSM8K-100	91.00 %	86.00 %	91.00 %	86.00 %	93.00 %	92.00 %
MATH-500-100	89.00 %	89.00 %	90.00 %	92.00 %	97.00 %	95.00 %
AIME 2024 (30)	73.33 %†	66.67 %†	70.00 %	36.67 %‡	76.67 %	76.67 %
IFEval-100 (prompt_strict)	95.00 %	93.00 %	89.00 %	94.00 %	96.00 %	96.00 %
HumanEval-164 chat	96.95 %	96.95 %	93.29 %	98.17 %	97.56 %	98.17 %
HumanEval+ chat (164)	92.07 %	91.46 %	87.20 %	92.68 %	92.07 %	93.90 %
LCB-medium-55 (v4 split)	87.27 %	78.18 %	80.00 %	85.45 %	96.36 %	96.36 %
ARC-Challenge chat (1172)	95.99 %	95.99 %	95.82 %	95.31 %	98.04 %	97.61 %

Bold = best in row. Sources: 128e and 98e v4 reuse the published baseline from the v5-it card; 98e v5 from solidpc 3090 9-bench (all 9 final on stack-pinned vLLM); 98e v5-coder from its own card; 31B-it (base) and 31B-he1 from L40 pod 37006213. All on vLLM 0.20.2 stock + --reasoning-parser gemma4 + thinking_token_budget=12288 + Fix-A reasoning_content fallback, greedy sampler. ARC numbers were rescored on stack-pinned hosts after the Fix-A patch to recover responses that vLLM had emitted as empty content with the answer stranded in reasoning_content.

† AIME 2024 correction (2026-05-21): the previously published 128e (36.67 %) and v4 (36.67 %) AIME values were produced against aime24 (non-chat) on an earlier vLLM dev build that under-scored AIME for the 26B-A4B class. Re-eval against the aime24_chat shadow task on stack-pinned stock vLLM 0.20.2 gives 128e = 73.33 % (22/30) and v4 = 66.67 % (20/30). These are the values reflected above.

‡ v5-coder AIME = 36.67 % is from the same earlier-stack run and is currently being re-evaluated on the canary-verified stack (T79). The corrected value will be published on the v5-coder card when the re-eval lands.

Methodology: this cohort is now anchored to EVAL_PROTOCOL v3 — stack lock + structural canary + reference anchor bench, see omnimergekit/eval/stack.lock.yaml and structural_canary.py. Any number published in this table passes the canary against stock vLLM 0.20.2 + Fix-A.

What the he1 column tells us

All 9 canonical benches landed for he1 on 2026-05-18 / -19. The picture against base 31B-it:

GPQA Diamond +3.03 pp (84.34 % vs 81.31 %) — the surprise gain. The L0–L3 partial prune (≈0.83 % attention savings) appears to slightly favour multi-domain reasoning paths rather than degrade them.
HumanEval+ 93.90 % (+1.83 pp vs base; new best in the cohort, +1.22 pp over v5-coder's 92.68 %) — the second surprise. The same partial-prune that didn't regress anything also lifts the harder HE+ split above every Gemma 4 variant tested so far.
LCB-medium-55 96.36 % (tied with base for best in cohort) — apples-to-apples with base on the hardest code split; well above every MoE variant.
AIME 2024 76.67 % (tied with base), IFEval 96.00 % (tied), HumanEval 98.17 % (+0.61 pp) — apples-to-apples on instruction-following and code-generation.
GSM8K −1.00 pp, MATH-500 −2.00 pp, ARC −0.43 pp — within their per-bench stderr (±1.7 to ±2.7 pp); not statistically distinguishable from base.

Practical reading: the L0–L3-only modification doesn't regress on any canonical bench, lifts GPQA Diamond by +3 pp, and lifts HumanEval+ by +1.83 pp — making he1 the strongest Gemma 4 variant in this cohort on the two hardest benches. All 9/9 benches are now complete.

Pruning Method

The mechanical recipe is implemented in recipes/gemma4_31b/prune_local_heal.py (omnimergekit) and runs in five phases. The phases below describe what the recipe attempted — the saved weights ultimately only retained the L0–L3 portion (see "What's new in v2" above).

Phase 0 — Calibration capture. Teacher-force forward pass on a short instruction-following corpus; record per-head Q/K/V activations and the residual stream at each layer's attention entry point. Phase 0 runs at fp32 with no accelerate offload so the cache is bit-stable.
Phase 1 — nf4 importance. Load the model in 4-bit, run an nf4 forward over the same prompts, compute a per-head leave-one-out importance metric (NLL delta when that head's contribution is zeroed). The top-12.5% lowest-importance heads per layer are tagged for removal across all layers (50 sliding-attention + 10 full-attention).
Phase 2 — lstsq O-projection heal. For every layer, the kept-heads' attention output is set and O' = (full_attn_out)·pinv(kept_attn_out) solves for the post-attention residual on the calibration set in the least-squares sense. Ridge (ridge_rel=0.01) regularizes the pseudo-inverse. The manifest's refit_stats records this for all 60 layers (mean rel-resid 1.1%, max 3.8%).
Phase 3a/b/c — materialize and save. Full-precision materialization on CPU, mask-mode (do not reshape num_attention_heads), staged safetensors save with the prune manifest embedded.
Phase 2.5 — AR canary. Generation-mode (not teacher-force) sanity check against three known-good prompts. The original v1 full-60-layer canary failed (manifest carries canary_failed: True); the salvage path kept only L0–L3's q_proj/o_proj changes from the build, restoring L4–L59 to base. v2 re-ran the same partial pipeline locally on CPU and produced the same L0–L3-only saved footprint.

Why mask-mode and not reshape?

The Gemma 4 31B attention block has per-layer-type structure: sliding-attention layers use head_dim=256, n_kv=16 (GQA 2:1); full-attention layers use head_dim=512, n_kv=4 with attention_k_eq_v=True and no V projection. The per-layer-type Q/K/V/O shapes are intricate. Mask-mode keeps num_attention_heads global and zeros the pruned heads' contributions inside the O-projection, which:

avoids fighting LCM constraints on reshape (only Q=24/16/32 are LCM-legal at 12.5% target across all layer types),
leaves the model loadable by stock transformers with no config patches,
lets the lstsq heal target whatever surviving subspace gives the best post-attention residual.

A reshape variant (T16-Option-B at Q=24) is planned as a follow-up to compare on-disk savings.

The BOS gotcha (and why earlier canary runs failed)

tokenizer.__call__(prompt) does not prepend <bos> for Gemma 4 by default; llama.cpp /completion does. The pruned + healed weights are BOS-sensitive: without a leading <bos> they collapse into token loops (額額額, France is France is) even when the same weights decode perfectly under llama.cpp. The fix is a one-line prepend in capture_canary_baseline and is documented in the recipe README. The AR canary in this build had a stale cache that masked this bug for several iterations.

Files

model-00001-of-00002.safetensors, model-00002-of-00002.safetensors — bf16 weights (≈ 59 GB total). Of the 1188 tensors, 8 differ from the base model: q_proj and o_proj for layers 0, 1, 2, 3 (sliding-attention).
model.safetensors.index.json — shard index (byte-identical to the base; same shard split).
config.json — stock Gemma 4 31B-it config; num_attention_heads unchanged (mask-mode).
tokenizer.json, tokenizer_config.json, chat_template.jinja — copied from the base.
prune_manifest.json — recipe metadata. Note: pruned_layers and refit_stats list all 60 layers because the recipe computed importance + lstsq math for every layer; only L0–L3's q_proj/o_proj actually made it into the saved safetensors.

Usage

Transformers (bf16 / nf4)

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
                        bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True)
model = AutoModelForCausalLM.from_pretrained(
    "ManniX-ITA/gemma-4-31b-he1-it",
    quantization_config=bnb, device_map={"": 0}, attn_implementation="eager",
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-31b-he1-it")

msgs = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
ids = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(ids, max_new_tokens=400, do_sample=False, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True))

llama.cpp (recommended for serving)

llama-server -m gemma-4-31B-he1-it-Q4_K_M.gguf \
    --port 8099 -c 32768 -ngl 99 --no-warmup --parallel 2 \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --reasoning-format deepseek --reasoning-budget 8192 \
    --temp 1.0 --top-p 0.95 --top-k 64

For greedy code-generation, replace the sampler with --temp 0 --top-p 1 --top-k 0 --reasoning off.

GGUF quants: ManniX-ITA/gemma-4-31b-he1-it-GGUF (v1 quants currently; v2 republish pending).

Recipe + Code

OmniMergeKit is the canonical home:

recipes/gemma4_31b/prune_local_heal.py — the full prune + heal + canary pipeline
scripts/replay_prune.sh — preflight / mlx neutralization / dependency pin wrapper for pod runs
eval/EVAL_PROTOCOL.md — canonical eval methodology (HE / MBPP / GPQA / LCB / MPE)

Related Models

Model	Description
gemma-4-31b-he1-it-GGUF	imatrix GGUF quants of this model (v1; v2 republish pending)

License

This model inherits the Gemma license from the base model.

Acknowledgements

Google for the base Gemma 4 31B-it model
bartowski for the calibration data v5 used in imatrix GGUF quantization
The OmniMergeKit project for the prune / heal / canary toolkit

Downloads last month: 177

Safetensors

Model size

31B params

Tensor type

BF16

Model tree for ManniX-ITA/gemma-4-31b-he1-it

Base model

google/gemma-4-31B

Finetuned

google/gemma-4-31B-it

Finetuned

(169)

this model

Quantizations

2 models