Otter v3 — Gemma 4 31B Surgically Pruned (58 layers)

🏆 Smaller & More Capable: gemma-4-31b-it-IQ2_M-GGUF

If you want the smallest Gemma 4 31B variant that preserves (or beats) f16 quality, use our custom IQ2_M at 10.17 GB — it scores F1 84.71% and BLEU-4 22.39 on NL2Bash (beats f16 on BLEU-4, 6× smaller). Otter v3 below is kept as the research artifact for layer-pruning without recovery fine-tuning. The IQ2_M repo is the practical deployment choice.

Otter v3 is Google's Gemma 4 31B IT with 2 surgically removed transformer layers (layers 27 and 29), identified through a 3-phase empirical study. The remaining 58 layers are quantized to GGUF without any fine-tuning recovery.

Why This Model (research artifact)

A 3-phase layer study found:

Layer 27 (sliding attention): single-layer ablation REDUCES perplexity by 74%
Layer 29 (full attention): single-layer ablation REDUCES perplexity by 71%
Block ablation {27, 29}: PPL drops from 496 → 117 (-76%)

These layers were adding noise. Otter v3 is a research demonstration that some layers can be surgically removed with positive effect on perplexity — but dynamic quantization (see IQ2_M above) dominates it on downstream benchmarks.

Files

File	Size	BPW	CLI 7/7 (thinking)	NL2Bash F1 (no-think)
`otter-v3-Q3_K_M.gguf`	14.1 GB	3.98	7/7 ✓	79.89%
`otter-v3-IQ3_XS.gguf`	12.1 GB	3.41	6/7	78.72%

NL2Bash Benchmark Results — Full Comparison

50 examples from the official NL2Bash test split (Stanford/Tellina), reasoning OFF:

Model	Size	Layers	Char-F1	BLEU-1	BLEU-4	EM
Unsloth UD-IQ3_XXS	11.84 GB	60	85.06%	46.30	20.26	12%
Base f16 (full precision)	61.4 GB	60	84.76%	43.94	21.02	12%
Base IQ3_XS (sibling repo)	13.1 GB	60	84.76%	42.77	18.95	8%
✨ IQ2_M (CLI imatrix, recommended)	10.17 GB	60	84.71%	44.72	22.39	12%
Unsloth UD-IQ2_M	10.75 GB	60	84.02%	42.38	18.64	10%
Otter-v3 Q3_K_M (this repo)	14.1 GB	58	79.89%	35.09	12.80	0%
Otter-v3 IQ3_XS (this repo)	12.1 GB	58	78.72%	37.01	15.14	4%
Base Q2_K (3rd party)	11.0 GB	60	58.60%	18.82	6.92	0%

Honest conclusion: On NL2Bash, layer pruning without recovery training is dominated by dynamic quantization. Our custom IQ2_M at 10.17 GB scores 6 F1 points higher than Otter-v3 IQ3_XS at 12.1 GB. Keep Otter as a study exhibit; deploy the IQ2_M.

Full study + raw benchmark data: https://huggingface.co/datasets/KikoCis/gemma4-31b-layer-study Interactive site: https://kikocisbot.github.io/gemma4-31b-study (and soon https://otter.utopiaia.com)

CLI Benchmark (7/7 with thinking)

Test	Otter v3 Q3_K_M
Install neofetch on Void Linux	`sudo xbps-install -S neofetch` ✓
Install htop on Ubuntu	`sudo apt update && sudo apt install htop` ✓
Search ripgrep on Arch	`pacman -Ss ripgrep` ✓
Search packages on Void	`xbps-query -Rs <package>` ✓
Add cargo to PATH in zsh	(multi-step zshrc edit) ✓
Install jq on macOS	`brew install jq` ✓
Grep TODO in /var/www	`grep -r "TODO" /var/www` ✓

Layer Analysis Methodology

Phase A — Block Influence (BI Score)

Measured 1 - cosine_similarity(h_in, h_out) for each layer across 300 prompts in 6 categories.

Phase B — Logit Lens

Projected each layer's hidden state through the final norm + lm_head. Decision happens in layers 56-59.

Phase C — Single & Block Ablation

Disabled each layer individually and measured perplexity on held-out text. Identified layers 27 and 29 as net-negative.

Full study + raw measurements + interactive visualization: https://huggingface.co/datasets/KikoCis/gemma4-31b-layer-study

Usage

# llama.cpp with thinking (recommended for best results with this model)
llama-cli -m otter-v3-Q3_K_M.gguf -cnv -ngl 99 --ctx-size 8192 \
  --reasoning on --reasoning-budget 512

Quantization Details

Tool: llama.cpp (build 0d049d6)
Q3_K_M: No imatrix (mixed q3_K/q4_K/q5_K)
IQ3_XS: Custom imatrix computed from 200 chunks on the 58-layer model
Source: Custom 58-layer safetensors (base model minus layers 27, 29)

Hardware

Recommended RAM: 16GB+ (Q3_K_M) or 14GB+ (IQ3_XS)
Apple Silicon: ~22 tok/s on M4 Max with full GPU offload

Limitations

Requires reasoning enabled for best results on complex tasks
Without thinking, loses ~5 F1 vs base on NL2Bash (dominated by IQ2_M)
58-layer architecture is non-standard; some frameworks may need updates
No fine-tuning recovery applied — pure structural pruning

Related Resources

Gemma 4 31B IT — IQ2_M (CLI imatrix, 10.17 GB, F1 84.71%) ⭐ recommended
Base IQ3_XS (60-layer, 13.1 GB, F1 84.76%)
Full layer analysis study — raw data + scripts + interactive visualization
Interactive study site

Real-World Agent Test Warning (April 2026)

Benchmark scores do not predict agent capability. In Docker-based autonomous testing, fine-tuned E4B models (95% BFCL) scored 0/10 while the unfine-tuned base scored 6/10. Fine-tuning for BFCL destroyed general reasoning (error recovery, strategy adaptation, anti-repetition). Fine-tuned E4B models have been withdrawn.

For autonomous agent tasks, use the base Gemma 4 model or a larger model at higher BPW. See: The Benchmark Trap — Full Study

Downloads last month: 2,143

GGUF

Model size

30B params

Architecture

gemma4

Hardware compatibility

3-bit