Instructions to use ManniX-ITA/gemma-4-31b-he1-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ManniX-ITA/gemma-4-31b-he1-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ManniX-ITA/gemma-4-31b-he1-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("ManniX-ITA/gemma-4-31b-he1-it") model = AutoModelForImageTextToText.from_pretrained("ManniX-ITA/gemma-4-31b-he1-it") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ManniX-ITA/gemma-4-31b-he1-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ManniX-ITA/gemma-4-31b-he1-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ManniX-ITA/gemma-4-31b-he1-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ManniX-ITA/gemma-4-31b-he1-it
- SGLang
How to use ManniX-ITA/gemma-4-31b-he1-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ManniX-ITA/gemma-4-31b-he1-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ManniX-ITA/gemma-4-31b-he1-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ManniX-ITA/gemma-4-31b-he1-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ManniX-ITA/gemma-4-31b-he1-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ManniX-ITA/gemma-4-31b-he1-it with Docker Model Runner:
docker model run hf.co/ManniX-ITA/gemma-4-31b-he1-it
gemma-4-31B-he1-it — v2 (2026-05-17)
Partial head-prune of google/gemma-4-31B-it: 12.5% of Q-heads removed in the first 4 sliding-attention layers (L0–L3), with an lstsq heal of those layers' O-projections. L4–L59 are byte-identical to the base. The model matches base-model quality on HumanEval and MBPP under chat-completions evaluation; the early-layer perturbation is what the local rebuild captured, and the gains come from the heal solution found on those 4 layers.
The build is fully reproducible via OmniMergeKit (recipes/gemma4_31b/prune_local_heal.py).
What's new in v2
This is a new local CPU rebuild of the same partial-prune recipe, replacing the original pod-built v1 weights.
What's actually different from base (verified via per-tensor diff against google/gemma-4-31B-it):
| Tensor | Layers changed | Layers unchanged |
|---|---|---|
self_attn.q_proj.weight |
0, 1, 2, 3 | 4 – 59 (all 56) |
self_attn.o_proj.weight |
0, 1, 2, 3 | 4 – 59 (all 56) |
self_attn.{k_proj, v_proj}.weight |
(none) | 0 – 59 |
| FFN gate/up/down, norms, embed, lm_head | (none) | 0 – 59 |
pruned_layers and refit_stats inside prune_manifest.json list all 60 layers because the recipe computed the lstsq math for every layer; only L0–L3's q_proj/o_proj were actually written into the saved safetensors (L4–L59 kept their base values). v1 had the same L0–L3-only footprint; v2 differs from v1 only on q_proj/o_proj of L0–L3 (slightly different lstsq solutions arising from the local CPU calibration capture).
SHA256 evidence:
model-00001-of-00002.safetensors: v2fdcaa60e…1a409ae9ddvs v1acce9a7a…06771a9dc— differs (L0–L3 q/o_proj).model-00002-of-00002.safetensors: v22bdb2df0…ae2b6b7d= v1 — byte-identical; all tensors in this shard are byte-identical to the base, the file just carries a larger header due to embedded build metadata.
Quality
llama-server Q4_K_M legacy eval (v1 cycle; v2 9-bench NVFP4A16 re-run results are in the canonical 9-bench cohort table below):
- HumanEval-chat: 98.78% (164 q) — v2 local rebuild, +1.22 pp vs v1's 97.56%, +0.61 pp vs base 98.17%.
- MBPP-chat: 85.20% (500 q) — local rebuild, −0.40 pp vs v1's 85.60%, +0.20 pp vs base 85.00%.
The deltas are within the published per-bench ±1.5 pp CI; treat as "matches base, doesn't regress."
v2 eval status (2026-05-19)
- NVFP4A16 9-bench canonical eval under the current methodology (
vLLM 0.20.2 stock+--reasoning-parser gemma4+thinking_token_budget=12288+ Fix-A reasoning_content fallback, greedy sampler) is complete: all 9/9 benches landed (see the cohort table below). HumanEval+ landed at 93.90 % (new best in cohort, +1.22 pp over v5-coder's 92.68 %) and LCB-medium-55 at 96.36 % (tied with base 31B-it for best in cohort). - Republish GGUF + NVFP4A16 quants at the
-GGUFsister repo (imatrix on calibration-v5 corpus, updatedimatrix.dat) — still pending the canonical eval pass on the new weights.
The Q4_K_M legacy table below remains the v1 reference; the canonical 9-bench table is the authoritative source for this v2 release.
TL;DR
- What: L0–L3-only Q-head prune at 12.5% on Gemma 4 31B-it. L4–L59 are byte-identical to
google/gemma-4-31B-it. - How: Per-head leave-one-out importance ranking on a teacher-force calibration set across all layers; mask-mode head dropping (no reshape); lstsq O-projection heal computed on every layer but only L0–L3's q_proj/o_proj written into the saved safetensors (L4–L59's healed weights were rolled back to base after the full-prune canary failed).
- Quality: Matches the unpruned base on HumanEval-chat and MBPP-chat under chat-completions Q4_K_M (legacy llama-server eval). v2 HE 98.78% (+0.61 pp vs base), MBPP 85.20% (+0.20 pp vs base) — both inside the per-bench ±1.5 pp CI.
- Effective parameter savings:
0.83% in attention only (4 layers × 12.5% Q-prune / 60 layers). Embed / FFN / norms unchanged. The intent of the original v1 was a full 12.5% prune across all 60 layers (13% attention savings); that build's canary failed, and only the L0–L3 changes survived into the saved checkpoint. Treat this repo as a partial-prune research artifact, not as the full-prune compression model originally targeted.
Code Benchmarks — chat-completions @ Q4_K_M (v1 reference; v2 re-run pending)
Evaluated via llama-server (build b9095-2-g0b04728) + lm-evaluation-harness chat-completions endpoint with --reasoning-format deepseek --reasoning-budget 8192, greedy sampler (temp=0), q8_0 KV cache, --parallel 2. Mandatory --use_cache SQLite + --log_samples per project protocol.
| Model | HumanEval-chat (164) | MBPP-chat (500) |
|---|---|---|
google/gemma-4-31B-it (base) Q4_K_M |
98.17% ±1.05 | 85.00% ±1.60 |
gemma-4-31B-he1-it v1 (replaced by this upload) Q4_K_M |
97.56% ±1.21 | 85.60% ±1.57 |
gemma-4-31B-he1-it v2 (this model — local rebuild) Q4_K_M |
98.78% ±0.86 | 85.20% ±1.59 |
Both v1 and v2 are L0–L3-only modifications. The published v1 was originally framed as a full-60-layer prune; per-tensor diff against the base shows that framing was incorrect for the actual saved weights. v2 ships with the same effective footprint and a corrected card.
These llama.cpp Q4_K_M chat-completions numbers are the legacy eval cycle; the full re-run under the current canonical methodology (Gemma 4 reasoning parser + thinking budget 12288) landed on 2026-05-19 — see the canonical 9-bench cohort table below. A vLLM-4bit cross-validation was attempted at v1 time but blocked on upstream vLLM gaps (Gemma 4 31B heterogeneous-head QKV split bug in vLLM ≤ 0.19; vLLM 0.20+ requires CUDA driver and glibc combinations not available on the eval hosts at build time). The v2 NVFP4A16 quant in the sister repo clears that path.
Canonical 9-bench NVFP4A16 cohort — Gemma 4 dense + MoE comparison
The current canonical Gemma 4 cohort uses vLLM 0.20.2 stock + --reasoning-parser gemma4 + thinking_token_budget=12288 with a greedy sampler (T=0, top_p=1, top_k=0, do_sample=False) — see omnimergekit/eval/EVAL_PROTOCOL.md. The 128e/v4 columns reuse the published baseline from the v5-it card.
| Bench (n) | 128e ref | 98e v4 | 98e v5 | 98e v5-coder | 31B-it (base) | 31B-he1 (this model) |
|---|---|---|---|---|---|---|
| GPQA Diamond (198) | 73.23 % | 69.19 % | 68.18 % | 68.69 % | 81.31 % | 84.34 % |
| GSM8K-100 | 91.00 % | 86.00 % | 91.00 % | 86.00 % | 93.00 % | 92.00 % |
| MATH-500-100 | 89.00 % | 89.00 % | 90.00 % | 92.00 % | 97.00 % | 95.00 % |
| AIME 2024 (30) | 73.33 %† | 66.67 %† | 70.00 % | 36.67 %‡ | 76.67 % | 76.67 % |
| IFEval-100 (prompt_strict) | 95.00 % | 93.00 % | 89.00 % | 94.00 % | 96.00 % | 96.00 % |
| HumanEval-164 chat | 96.95 % | 96.95 % | 93.29 % | 98.17 % | 97.56 % | 98.17 % |
| HumanEval+ chat (164) | 92.07 % | 91.46 % | 87.20 % | 92.68 % | 92.07 % | 93.90 % |
| LCB-medium-55 (v4 split) | 87.27 % | 78.18 % | 80.00 % | 85.45 % | 96.36 % | 96.36 % |
| ARC-Challenge chat (1172) | 95.99 % | 95.99 % | 95.82 % | 95.31 % | 98.04 % | 97.61 % |
Bold = best in row. Sources: 128e and 98e v4 reuse the published baseline from the v5-it card; 98e v5 from solidpc 3090 9-bench (all 9 final on stack-pinned vLLM); 98e v5-coder from its own card; 31B-it (base) and 31B-he1 from L40 pod 37006213. All on vLLM 0.20.2 stock + --reasoning-parser gemma4 + thinking_token_budget=12288 + Fix-A reasoning_content fallback, greedy sampler. ARC numbers were rescored on stack-pinned hosts after the Fix-A patch to recover responses that vLLM had emitted as empty content with the answer stranded in reasoning_content.
† AIME 2024 correction (2026-05-21): the previously published 128e (36.67 %) and v4 (36.67 %) AIME values were produced against
aime24(non-chat) on an earlier vLLM dev build that under-scored AIME for the 26B-A4B class. Re-eval against theaime24_chatshadow task on stack-pinned stock vLLM 0.20.2 gives 128e = 73.33 % (22/30) and v4 = 66.67 % (20/30). These are the values reflected above.‡ v5-coder AIME = 36.67 % is from the same earlier-stack run and is currently being re-evaluated on the canary-verified stack (T79). The corrected value will be published on the v5-coder card when the re-eval lands.
Methodology: this cohort is now anchored to EVAL_PROTOCOL v3 — stack lock + structural canary + reference anchor bench, see
omnimergekit/eval/stack.lock.yamlandstructural_canary.py. Any number published in this table passes the canary against stock vLLM 0.20.2 + Fix-A.
What the he1 column tells us
All 9 canonical benches landed for he1 on 2026-05-18 / -19. The picture against base 31B-it:
- GPQA Diamond +3.03 pp (84.34 % vs 81.31 %) — the surprise gain. The L0–L3 partial prune (≈0.83 % attention savings) appears to slightly favour multi-domain reasoning paths rather than degrade them.
- HumanEval+ 93.90 % (+1.83 pp vs base; new best in the cohort, +1.22 pp over v5-coder's 92.68 %) — the second surprise. The same partial-prune that didn't regress anything also lifts the harder HE+ split above every Gemma 4 variant tested so far.
- LCB-medium-55 96.36 % (tied with base for best in cohort) — apples-to-apples with base on the hardest code split; well above every MoE variant.
- AIME 2024 76.67 % (tied with base), IFEval 96.00 % (tied), HumanEval 98.17 % (+0.61 pp) — apples-to-apples on instruction-following and code-generation.
- GSM8K −1.00 pp, MATH-500 −2.00 pp, ARC −0.43 pp — within their per-bench stderr (±1.7 to ±2.7 pp); not statistically distinguishable from base.
Practical reading: the L0–L3-only modification doesn't regress on any canonical bench, lifts GPQA Diamond by +3 pp, and lifts HumanEval+ by +1.83 pp — making he1 the strongest Gemma 4 variant in this cohort on the two hardest benches. All 9/9 benches are now complete.
Pruning Method
The mechanical recipe is implemented in recipes/gemma4_31b/prune_local_heal.py (omnimergekit) and runs in five phases. The phases below describe what the recipe attempted — the saved weights ultimately only retained the L0–L3 portion (see "What's new in v2" above).
- Phase 0 — Calibration capture. Teacher-force forward pass on a short instruction-following corpus; record per-head Q/K/V activations and the residual stream at each layer's attention entry point. Phase 0 runs at fp32 with no
accelerateoffload so the cache is bit-stable. - Phase 1 — nf4 importance. Load the model in 4-bit, run an nf4 forward over the same prompts, compute a per-head leave-one-out importance metric (NLL delta when that head's contribution is zeroed). The top-12.5% lowest-importance heads per layer are tagged for removal across all layers (50 sliding-attention + 10 full-attention).
- Phase 2 — lstsq O-projection heal. For every layer, the kept-heads' attention output is set and
O' = (full_attn_out)·pinv(kept_attn_out)solves for the post-attention residual on the calibration set in the least-squares sense. Ridge (ridge_rel=0.01) regularizes the pseudo-inverse. The manifest'srefit_statsrecords this for all 60 layers (mean rel-resid 1.1%, max 3.8%). - Phase 3a/b/c — materialize and save. Full-precision materialization on CPU, mask-mode (do not reshape
num_attention_heads), staged safetensors save with the prune manifest embedded. - Phase 2.5 — AR canary. Generation-mode (not teacher-force) sanity check against three known-good prompts. The original v1 full-60-layer canary failed (manifest carries
canary_failed: True); the salvage path kept only L0–L3's q_proj/o_proj changes from the build, restoring L4–L59 to base. v2 re-ran the same partial pipeline locally on CPU and produced the same L0–L3-only saved footprint.
Why mask-mode and not reshape?
The Gemma 4 31B attention block has per-layer-type structure: sliding-attention layers use head_dim=256, n_kv=16 (GQA 2:1); full-attention layers use head_dim=512, n_kv=4 with attention_k_eq_v=True and no V projection. The per-layer-type Q/K/V/O shapes are intricate. Mask-mode keeps num_attention_heads global and zeros the pruned heads' contributions inside the O-projection, which:
- avoids fighting LCM constraints on reshape (only Q=24/16/32 are LCM-legal at 12.5% target across all layer types),
- leaves the model loadable by stock
transformerswith no config patches, - lets the lstsq heal target whatever surviving subspace gives the best post-attention residual.
A reshape variant (T16-Option-B at Q=24) is planned as a follow-up to compare on-disk savings.
The BOS gotcha (and why earlier canary runs failed)
tokenizer.__call__(prompt) does not prepend <bos> for Gemma 4 by default; llama.cpp /completion does. The pruned + healed weights are BOS-sensitive: without a leading <bos> they collapse into token loops (額額額, France is France is) even when the same weights decode perfectly under llama.cpp. The fix is a one-line prepend in capture_canary_baseline and is documented in the recipe README. The AR canary in this build had a stale cache that masked this bug for several iterations.
Files
model-00001-of-00002.safetensors,model-00002-of-00002.safetensors— bf16 weights (≈ 59 GB total). Of the 1188 tensors, 8 differ from the base model:q_projando_projfor layers 0, 1, 2, 3 (sliding-attention).model.safetensors.index.json— shard index (byte-identical to the base; same shard split).config.json— stock Gemma 4 31B-it config;num_attention_headsunchanged (mask-mode).tokenizer.json,tokenizer_config.json,chat_template.jinja— copied from the base.prune_manifest.json— recipe metadata. Note:pruned_layersandrefit_statslist all 60 layers because the recipe computed importance + lstsq math for every layer; only L0–L3's q_proj/o_proj actually made it into the saved safetensors.
Usage
Transformers (bf16 / nf4)
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True)
model = AutoModelForCausalLM.from_pretrained(
"ManniX-ITA/gemma-4-31b-he1-it",
quantization_config=bnb, device_map={"": 0}, attn_implementation="eager",
)
tok = AutoTokenizer.from_pretrained("ManniX-ITA/gemma-4-31b-he1-it")
msgs = [{"role": "user", "content": "Write a Python function to check if a number is prime."}]
ids = tok.apply_chat_template(msgs, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(ids, max_new_tokens=400, do_sample=False, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0, ids.shape[1]:], skip_special_tokens=True))
llama.cpp (recommended for serving)
llama-server -m gemma-4-31B-he1-it-Q4_K_M.gguf \
--port 8099 -c 32768 -ngl 99 --no-warmup --parallel 2 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--reasoning-format deepseek --reasoning-budget 8192 \
--temp 1.0 --top-p 0.95 --top-k 64
For greedy code-generation, replace the sampler with --temp 0 --top-p 1 --top-k 0 --reasoning off.
GGUF quants: ManniX-ITA/gemma-4-31b-he1-it-GGUF (v1 quants currently; v2 republish pending).
Recipe + Code
OmniMergeKit is the canonical home:
recipes/gemma4_31b/prune_local_heal.py— the full prune + heal + canary pipelinescripts/replay_prune.sh— preflight / mlx neutralization / dependency pin wrapper for pod runseval/EVAL_PROTOCOL.md— canonical eval methodology (HE / MBPP / GPQA / LCB / MPE)
Related Models
| Model | Description |
|---|---|
| gemma-4-31b-he1-it-GGUF | imatrix GGUF quants of this model (v1; v2 republish pending) |
License
This model inherits the Gemma license from the base model.
Acknowledgements
- Google for the base Gemma 4 31B-it model
- bartowski for the calibration data v5 used in imatrix GGUF quantization
- The OmniMergeKit project for the prune / heal / canary toolkit
- Downloads last month
- 177