Gemma 4 31B IT โ IQ3_XS GGUF
Gemma 4 31B Instruct quantized to IQ3_XS (3.40 BPW, 13.1GB) using llama.cpp with importance matrix calibration on CLI/package management data.
Matches f16 baseline on Char-F1 (84.76%) at 1/4.7 the size (50-example NL2Bash benchmark).
๐ก Looking for the smallest possible variant?
Try our gemma-4-31b-it-IQ2_M-GGUF โ 10.17 GB with F1 84.71% and BLEU-4 22.39 (beats f16 at 6ร smaller). Custom CLI-tuned imatrix on IQ2_M.
Key Stats
| Metric | Value |
|---|---|
| Base model | google/gemma-4-31b-it |
| Quantization | IQ3_XS (3.40 BPW) |
| Size | 13.1 GB |
| Layers | 60 (full model, no pruning) |
| NL2Bash Char-F1 | 84.76% (= f16 baseline) |
| CLI 7/7 | 7/7 with thinking enabled |
NL2Bash Benchmark โ Full Comparison
50 examples from the official Stanford/Tellina test split, reasoning OFF, max_tokens=200, temp=0.1, sorted by Char-F1.
| Model | Size | BPW | Char-F1 | BLEU-1 | BLEU-2 | BLEU-4 | EM |
|---|---|---|---|---|---|---|---|
| Unsloth UD-IQ3_XXS | 11.84 GB | ~2.98 | 85.06% | 46.30 | 34.64 | 20.26 | 12% |
| Base f16 (full precision) | 61.4 GB | 16.0 | 84.76% | 43.94 | 33.84 | 21.02 | 12% |
| Base IQ3_XS (this model) | 13.1 GB | 3.40 | 84.76% | 42.77 | 31.46 | 18.95 | 8% |
| โจ Sibling IQ2_M (CLI imatrix) | 10.17 GB | 2.84 | 84.71% | 44.72 | 34.36 | 22.39 | 12% |
| Unsloth UD-IQ2_M | 10.75 GB | ~2.70 | 84.02% | 42.38 | 31.73 | 18.64 | 10% |
| Unsloth UD-IQ2_XXS | 8.53 GB | ~2.06 | 76.82% | 33.03 | 22.17 | 11.49 | 0% |
| Base Q2_K (3rd party) | 11.0 GB | 2.70 | 58.60% | 18.82 | 12.79 | 6.92 | 0% |
Key observations
- This IQ3_XS ties f16 on Char-F1 (84.76% vs 84.76%) and comes within 2 F1 of the best available quantization
- Unsloth Dynamic 2.0 is the Pareto frontier: UD-IQ2_M (10.75 GB) is strictly better than plain Q2_K (+25.4 F1 points at same BPW) thanks to per-layer adaptive bit allocation and Gemma-4-specific imatrix
- Q2_K at 2.7 BPW collapses on Gemma 4 (F1 58.6%, produces repetitive output). Standard 2-bit scalar quantization is not viable for this architecture
- f16 only wins on token-exact metrics (BLEU-4, EM) โ IQ3_XS produces semantically equivalent commands with minor stylistic variations (different flag order, quoting style)
Recommendation
- Smallest working quant: Unsloth UD-IQ2_M (10.75 GB, F1 84%)
- Best overall: Unsloth UD-IQ3_XXS (11.84 GB, F1 85%)
- This model (IQ3_XS, 13.1 GB): simpler imatrix calibration (CLI-focused), slightly larger but within 0.3 F1 of UD-IQ3_XXS
Full per-question predictions + layer analysis study: https://huggingface.co/datasets/KikoCis/gemma4-31b-layer-study
CLI Benchmark Results (7/7 with thinking)
Tested with llama-cli -cnv --reasoning on --reasoning-budget 512:
| Test | Result |
|---|---|
| Install neofetch on Void Linux | sudo xbps-install -S neofetch |
| Install htop on Ubuntu | sudo apt install htop |
| Search ripgrep on Arch | pacman -Ss ripgrep |
| Search packages on Void | xbps-query -S <package_name> |
| Add cargo to PATH in zsh | echo 'export PATH="$HOME/.cargo/bin:$PATH"' >> ~/.zshrc |
| Install jq on macOS | brew install jq |
| Grep TODO in /var/www | grep -r "TODO" /var/www |
Usage
# llama.cpp with thinking
llama-cli -m gemma4-31b-IQ3_XS.gguf -cnv -ngl 99 --ctx-size 8192 \
--reasoning on --reasoning-budget 512
# Ollama
cat > Modelfile << 'EOF'
FROM ./gemma4-31b-IQ3_XS.gguf
PARAMETER temperature 0.1
PARAMETER num_ctx 8192
EOF
ollama create gemma4-31b-iq3xs -f Modelfile
Quantization Details
- Tool: llama.cpp (build 0d049d6)
- Imatrix: Computed from 200 chunks of synthetic CLI/package management data
- Source: Converted from google/gemma-4-31b-it safetensors via
convert_hf_to_gguf.py - Architecture: Gemma4 with sliding + full attention pattern (every 6th layer is full attention)
Hardware Requirements
- Minimum RAM: 16GB (with partial offload)
- Recommended: Apple Silicon with 32GB+ unified memory
- Performance: ~21 tok/s on M4 Max 128GB with full GPU offload
Benchmark Methodology
- NL2Bash test set: 50 examples from the official deterministic split (RANDOM_SEED=100, fold 11 from
TellinaTool/nl2bash) - Inference: Each question runs as an isolated llama-cli subprocess with
-pprompt,--reasoning off,--ctx-size 4096,--max-tokens 200,--temp 0.1 - Scoring: NLTK BLEU corpus-level + character-level F1 + exact match
- Reproducibility: All scripts and JSONL predictions are public at the dataset repo above
Real-World Agent Test Warning (April 2026)
Benchmark scores do not predict agent capability. In Docker-based autonomous testing, fine-tuned E4B models (95% BFCL) scored 0/10 while the unfine-tuned base scored 6/10. Fine-tuning for BFCL destroyed general reasoning (error recovery, strategy adaptation, anti-repetition). Fine-tuned E4B models have been withdrawn.
For autonomous agent tasks, use the base Gemma 4 model or a larger model at higher BPW. See: The Benchmark Trap โ Full Study
- Downloads last month
- 3,270
3-bit