Gemma 4 31B IT โ€” IQ3_XS GGUF

Gemma 4 31B Instruct quantized to IQ3_XS (3.40 BPW, 13.1GB) using llama.cpp with importance matrix calibration on CLI/package management data.

Matches f16 baseline on Char-F1 (84.76%) at 1/4.7 the size (50-example NL2Bash benchmark).

๐Ÿ’ก Looking for the smallest possible variant?

Try our gemma-4-31b-it-IQ2_M-GGUF โ€” 10.17 GB with F1 84.71% and BLEU-4 22.39 (beats f16 at 6ร— smaller). Custom CLI-tuned imatrix on IQ2_M.

Key Stats

Metric Value
Base model google/gemma-4-31b-it
Quantization IQ3_XS (3.40 BPW)
Size 13.1 GB
Layers 60 (full model, no pruning)
NL2Bash Char-F1 84.76% (= f16 baseline)
CLI 7/7 7/7 with thinking enabled

NL2Bash Benchmark โ€” Full Comparison

50 examples from the official Stanford/Tellina test split, reasoning OFF, max_tokens=200, temp=0.1, sorted by Char-F1.

Model Size BPW Char-F1 BLEU-1 BLEU-2 BLEU-4 EM
Unsloth UD-IQ3_XXS 11.84 GB ~2.98 85.06% 46.30 34.64 20.26 12%
Base f16 (full precision) 61.4 GB 16.0 84.76% 43.94 33.84 21.02 12%
Base IQ3_XS (this model) 13.1 GB 3.40 84.76% 42.77 31.46 18.95 8%
โœจ Sibling IQ2_M (CLI imatrix) 10.17 GB 2.84 84.71% 44.72 34.36 22.39 12%
Unsloth UD-IQ2_M 10.75 GB ~2.70 84.02% 42.38 31.73 18.64 10%
Unsloth UD-IQ2_XXS 8.53 GB ~2.06 76.82% 33.03 22.17 11.49 0%
Base Q2_K (3rd party) 11.0 GB 2.70 58.60% 18.82 12.79 6.92 0%

Key observations

  • This IQ3_XS ties f16 on Char-F1 (84.76% vs 84.76%) and comes within 2 F1 of the best available quantization
  • Unsloth Dynamic 2.0 is the Pareto frontier: UD-IQ2_M (10.75 GB) is strictly better than plain Q2_K (+25.4 F1 points at same BPW) thanks to per-layer adaptive bit allocation and Gemma-4-specific imatrix
  • Q2_K at 2.7 BPW collapses on Gemma 4 (F1 58.6%, produces repetitive output). Standard 2-bit scalar quantization is not viable for this architecture
  • f16 only wins on token-exact metrics (BLEU-4, EM) โ€” IQ3_XS produces semantically equivalent commands with minor stylistic variations (different flag order, quoting style)

Recommendation

  • Smallest working quant: Unsloth UD-IQ2_M (10.75 GB, F1 84%)
  • Best overall: Unsloth UD-IQ3_XXS (11.84 GB, F1 85%)
  • This model (IQ3_XS, 13.1 GB): simpler imatrix calibration (CLI-focused), slightly larger but within 0.3 F1 of UD-IQ3_XXS

Full per-question predictions + layer analysis study: https://huggingface.co/datasets/KikoCis/gemma4-31b-layer-study

CLI Benchmark Results (7/7 with thinking)

Tested with llama-cli -cnv --reasoning on --reasoning-budget 512:

Test Result
Install neofetch on Void Linux sudo xbps-install -S neofetch
Install htop on Ubuntu sudo apt install htop
Search ripgrep on Arch pacman -Ss ripgrep
Search packages on Void xbps-query -S <package_name>
Add cargo to PATH in zsh echo 'export PATH="$HOME/.cargo/bin:$PATH"' >> ~/.zshrc
Install jq on macOS brew install jq
Grep TODO in /var/www grep -r "TODO" /var/www

Usage

# llama.cpp with thinking
llama-cli -m gemma4-31b-IQ3_XS.gguf -cnv -ngl 99 --ctx-size 8192 \
  --reasoning on --reasoning-budget 512

# Ollama
cat > Modelfile << 'EOF'
FROM ./gemma4-31b-IQ3_XS.gguf
PARAMETER temperature 0.1
PARAMETER num_ctx 8192
EOF
ollama create gemma4-31b-iq3xs -f Modelfile

Quantization Details

  • Tool: llama.cpp (build 0d049d6)
  • Imatrix: Computed from 200 chunks of synthetic CLI/package management data
  • Source: Converted from google/gemma-4-31b-it safetensors via convert_hf_to_gguf.py
  • Architecture: Gemma4 with sliding + full attention pattern (every 6th layer is full attention)

Hardware Requirements

  • Minimum RAM: 16GB (with partial offload)
  • Recommended: Apple Silicon with 32GB+ unified memory
  • Performance: ~21 tok/s on M4 Max 128GB with full GPU offload

Benchmark Methodology

  • NL2Bash test set: 50 examples from the official deterministic split (RANDOM_SEED=100, fold 11 from TellinaTool/nl2bash)
  • Inference: Each question runs as an isolated llama-cli subprocess with -p prompt, --reasoning off, --ctx-size 4096, --max-tokens 200, --temp 0.1
  • Scoring: NLTK BLEU corpus-level + character-level F1 + exact match
  • Reproducibility: All scripts and JSONL predictions are public at the dataset repo above

Real-World Agent Test Warning (April 2026)

Benchmark scores do not predict agent capability. In Docker-based autonomous testing, fine-tuned E4B models (95% BFCL) scored 0/10 while the unfine-tuned base scored 6/10. Fine-tuning for BFCL destroyed general reasoning (error recovery, strategy adaptation, anti-repetition). Fine-tuned E4B models have been withdrawn.

For autonomous agent tasks, use the base Gemma 4 model or a larger model at higher BPW. See: The Benchmark Trap โ€” Full Study

Downloads last month
3,270
GGUF
Model size
31B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support