Gemma 4 31B IT — IQ3_XS GGUF

Gemma 4 31B Instruct quantized to IQ3_XS (3.40 BPW, 13.1GB) using llama.cpp with importance matrix calibration on CLI/package management data.

Matches f16 baseline on Char-F1 (84.76%) at 1/4.7 the size (50-example NL2Bash benchmark).

💡 Looking for the smallest possible variant?

Try our gemma-4-31b-it-IQ2_M-GGUF — 10.17 GB with F1 84.71% and BLEU-4 22.39 (beats f16 at 6× smaller). Custom CLI-tuned imatrix on IQ2_M.

Key Stats

Metric	Value
Base model	google/gemma-4-31b-it
Quantization	IQ3_XS (3.40 BPW)
Size	13.1 GB
Layers	60 (full model, no pruning)
NL2Bash Char-F1	84.76% (= f16 baseline)
CLI 7/7	7/7 with thinking enabled

NL2Bash Benchmark — Full Comparison

50 examples from the official Stanford/Tellina test split, reasoning OFF, max_tokens=200, temp=0.1, sorted by Char-F1.

Model	Size	BPW	Char-F1	BLEU-1	BLEU-2	BLEU-4	EM
Unsloth UD-IQ3_XXS	11.84 GB	~2.98	85.06%	46.30	34.64	20.26	12%
Base f16 (full precision)	61.4 GB	16.0	84.76%	43.94	33.84	21.02	12%
Base IQ3_XS (this model)	13.1 GB	3.40	84.76%	42.77	31.46	18.95	8%
✨ Sibling IQ2_M (CLI imatrix)	10.17 GB	2.84	84.71%	44.72	34.36	22.39	12%
Unsloth UD-IQ2_M	10.75 GB	~2.70	84.02%	42.38	31.73	18.64	10%
Unsloth UD-IQ2_XXS	8.53 GB	~2.06	76.82%	33.03	22.17	11.49	0%
Base Q2_K (3rd party)	11.0 GB	2.70	58.60%	18.82	12.79	6.92	0%

Key observations

This IQ3_XS ties f16 on Char-F1 (84.76% vs 84.76%) and comes within 2 F1 of the best available quantization
Unsloth Dynamic 2.0 is the Pareto frontier: UD-IQ2_M (10.75 GB) is strictly better than plain Q2_K (+25.4 F1 points at same BPW) thanks to per-layer adaptive bit allocation and Gemma-4-specific imatrix
Q2_K at 2.7 BPW collapses on Gemma 4 (F1 58.6%, produces repetitive output). Standard 2-bit scalar quantization is not viable for this architecture
f16 only wins on token-exact metrics (BLEU-4, EM) — IQ3_XS produces semantically equivalent commands with minor stylistic variations (different flag order, quoting style)

Recommendation

Smallest working quant: Unsloth UD-IQ2_M (10.75 GB, F1 84%)
Best overall: Unsloth UD-IQ3_XXS (11.84 GB, F1 85%)
This model (IQ3_XS, 13.1 GB): simpler imatrix calibration (CLI-focused), slightly larger but within 0.3 F1 of UD-IQ3_XXS

Full per-question predictions + layer analysis study: https://huggingface.co/datasets/KikoCis/gemma4-31b-layer-study

CLI Benchmark Results (7/7 with thinking)

Tested with llama-cli -cnv --reasoning on --reasoning-budget 512:

Test	Result
Install neofetch on Void Linux	`sudo xbps-install -S neofetch`
Install htop on Ubuntu	`sudo apt install htop`
Search ripgrep on Arch	`pacman -Ss ripgrep`
Search packages on Void	`xbps-query -S <package_name>`
Add cargo to PATH in zsh	`echo 'export PATH="$HOME/.cargo/bin:$PATH"' >> ~/.zshrc`
Install jq on macOS	`brew install jq`
Grep TODO in /var/www	`grep -r "TODO" /var/www`

Usage

# llama.cpp with thinking
llama-cli -m gemma4-31b-IQ3_XS.gguf -cnv -ngl 99 --ctx-size 8192 \
  --reasoning on --reasoning-budget 512

# Ollama
cat > Modelfile << 'EOF'
FROM ./gemma4-31b-IQ3_XS.gguf
PARAMETER temperature 0.1
PARAMETER num_ctx 8192
EOF
ollama create gemma4-31b-iq3xs -f Modelfile

Quantization Details

Tool: llama.cpp (build 0d049d6)
Imatrix: Computed from 200 chunks of synthetic CLI/package management data
Source: Converted from google/gemma-4-31b-it safetensors via convert_hf_to_gguf.py
Architecture: Gemma4 with sliding + full attention pattern (every 6th layer is full attention)

Hardware Requirements

Minimum RAM: 16GB (with partial offload)
Recommended: Apple Silicon with 32GB+ unified memory
Performance: ~21 tok/s on M4 Max 128GB with full GPU offload

Benchmark Methodology

NL2Bash test set: 50 examples from the official deterministic split (RANDOM_SEED=100, fold 11 from TellinaTool/nl2bash)
Inference: Each question runs as an isolated llama-cli subprocess with -p prompt, --reasoning off, --ctx-size 4096, --max-tokens 200, --temp 0.1
Scoring: NLTK BLEU corpus-level + character-level F1 + exact match
Reproducibility: All scripts and JSONL predictions are public at the dataset repo above

Real-World Agent Test Warning (April 2026)

Benchmark scores do not predict agent capability. In Docker-based autonomous testing, fine-tuned E4B models (95% BFCL) scored 0/10 while the unfine-tuned base scored 6/10. Fine-tuning for BFCL destroyed general reasoning (error recovery, strategy adaptation, anti-repetition). Fine-tuned E4B models have been withdrawn.

For autonomous agent tasks, use the base Gemma 4 model or a larger model at higher BPW. See: The Benchmark Trap — Full Study

Downloads last month: 3,270

GGUF

Model size

31B params

Architecture

gemma4

Hardware compatibility

3-bit