Otter v3 โ Gemma 4 31B Surgically Pruned (58 layers)
๐ Smaller & More Capable: gemma-4-31b-it-IQ2_M-GGUF
If you want the smallest Gemma 4 31B variant that preserves (or beats) f16 quality, use our custom IQ2_M at 10.17 GB โ it scores F1 84.71% and BLEU-4 22.39 on NL2Bash (beats f16 on BLEU-4, 6ร smaller). Otter v3 below is kept as the research artifact for layer-pruning without recovery fine-tuning. The IQ2_M repo is the practical deployment choice.
Otter v3 is Google's Gemma 4 31B IT with 2 surgically removed transformer layers (layers 27 and 29), identified through a 3-phase empirical study. The remaining 58 layers are quantized to GGUF without any fine-tuning recovery.
Why This Model (research artifact)
A 3-phase layer study found:
- Layer 27 (sliding attention): single-layer ablation REDUCES perplexity by 74%
- Layer 29 (full attention): single-layer ablation REDUCES perplexity by 71%
- Block ablation {27, 29}: PPL drops from 496 โ 117 (-76%)
These layers were adding noise. Otter v3 is a research demonstration that some layers can be surgically removed with positive effect on perplexity โ but dynamic quantization (see IQ2_M above) dominates it on downstream benchmarks.
Files
| File | Size | BPW | CLI 7/7 (thinking) | NL2Bash F1 (no-think) |
|---|---|---|---|---|
otter-v3-Q3_K_M.gguf |
14.1 GB | 3.98 | 7/7 โ | 79.89% |
otter-v3-IQ3_XS.gguf |
12.1 GB | 3.41 | 6/7 | 78.72% |
NL2Bash Benchmark Results โ Full Comparison
50 examples from the official NL2Bash test split (Stanford/Tellina), reasoning OFF:
| Model | Size | Layers | Char-F1 | BLEU-1 | BLEU-4 | EM |
|---|---|---|---|---|---|---|
| Unsloth UD-IQ3_XXS | 11.84 GB | 60 | 85.06% | 46.30 | 20.26 | 12% |
| Base f16 (full precision) | 61.4 GB | 60 | 84.76% | 43.94 | 21.02 | 12% |
| Base IQ3_XS (sibling repo) | 13.1 GB | 60 | 84.76% | 42.77 | 18.95 | 8% |
| โจ IQ2_M (CLI imatrix, recommended) | 10.17 GB | 60 | 84.71% | 44.72 | 22.39 | 12% |
| Unsloth UD-IQ2_M | 10.75 GB | 60 | 84.02% | 42.38 | 18.64 | 10% |
| Otter-v3 Q3_K_M (this repo) | 14.1 GB | 58 | 79.89% | 35.09 | 12.80 | 0% |
| Otter-v3 IQ3_XS (this repo) | 12.1 GB | 58 | 78.72% | 37.01 | 15.14 | 4% |
| Base Q2_K (3rd party) | 11.0 GB | 60 | 58.60% | 18.82 | 6.92 | 0% |
Honest conclusion: On NL2Bash, layer pruning without recovery training is dominated by dynamic quantization. Our custom IQ2_M at 10.17 GB scores 6 F1 points higher than Otter-v3 IQ3_XS at 12.1 GB. Keep Otter as a study exhibit; deploy the IQ2_M.
Full study + raw benchmark data: https://huggingface.co/datasets/KikoCis/gemma4-31b-layer-study Interactive site: https://kikocisbot.github.io/gemma4-31b-study (and soon https://otter.utopiaia.com)
CLI Benchmark (7/7 with thinking)
| Test | Otter v3 Q3_K_M |
|---|---|
| Install neofetch on Void Linux | sudo xbps-install -S neofetch โ |
| Install htop on Ubuntu | sudo apt update && sudo apt install htop โ |
| Search ripgrep on Arch | pacman -Ss ripgrep โ |
| Search packages on Void | xbps-query -Rs <package> โ |
| Add cargo to PATH in zsh | (multi-step zshrc edit) โ |
| Install jq on macOS | brew install jq โ |
| Grep TODO in /var/www | grep -r "TODO" /var/www โ |
Layer Analysis Methodology
Phase A โ Block Influence (BI Score)
Measured 1 - cosine_similarity(h_in, h_out) for each layer across 300 prompts in 6 categories.
Phase B โ Logit Lens
Projected each layer's hidden state through the final norm + lm_head. Decision happens in layers 56-59.
Phase C โ Single & Block Ablation
Disabled each layer individually and measured perplexity on held-out text. Identified layers 27 and 29 as net-negative.
Full study + raw measurements + interactive visualization: https://huggingface.co/datasets/KikoCis/gemma4-31b-layer-study
Usage
# llama.cpp with thinking (recommended for best results with this model)
llama-cli -m otter-v3-Q3_K_M.gguf -cnv -ngl 99 --ctx-size 8192 \
--reasoning on --reasoning-budget 512
Quantization Details
- Tool: llama.cpp (build 0d049d6)
- Q3_K_M: No imatrix (mixed q3_K/q4_K/q5_K)
- IQ3_XS: Custom imatrix computed from 200 chunks on the 58-layer model
- Source: Custom 58-layer safetensors (base model minus layers 27, 29)
Hardware
- Recommended RAM: 16GB+ (Q3_K_M) or 14GB+ (IQ3_XS)
- Apple Silicon: ~22 tok/s on M4 Max with full GPU offload
Limitations
- Requires reasoning enabled for best results on complex tasks
- Without thinking, loses ~5 F1 vs base on NL2Bash (dominated by IQ2_M)
- 58-layer architecture is non-standard; some frameworks may need updates
- No fine-tuning recovery applied โ pure structural pruning
Related Resources
- Gemma 4 31B IT โ IQ2_M (CLI imatrix, 10.17 GB, F1 84.71%) โญ recommended
- Base IQ3_XS (60-layer, 13.1 GB, F1 84.76%)
- Full layer analysis study โ raw data + scripts + interactive visualization
- Interactive study site
Real-World Agent Test Warning (April 2026)
Benchmark scores do not predict agent capability. In Docker-based autonomous testing, fine-tuned E4B models (95% BFCL) scored 0/10 while the unfine-tuned base scored 6/10. Fine-tuning for BFCL destroyed general reasoning (error recovery, strategy adaptation, anti-repetition). Fine-tuned E4B models have been withdrawn.
For autonomous agent tasks, use the base Gemma 4 model or a larger model at higher BPW. See: The Benchmark Trap โ Full Study
- Downloads last month
- 2,143
3-bit