Otter v3 โ€” Gemma 4 31B Surgically Pruned (58 layers)

๐Ÿ† Smaller & More Capable: gemma-4-31b-it-IQ2_M-GGUF

If you want the smallest Gemma 4 31B variant that preserves (or beats) f16 quality, use our custom IQ2_M at 10.17 GB โ€” it scores F1 84.71% and BLEU-4 22.39 on NL2Bash (beats f16 on BLEU-4, 6ร— smaller). Otter v3 below is kept as the research artifact for layer-pruning without recovery fine-tuning. The IQ2_M repo is the practical deployment choice.


Otter v3 is Google's Gemma 4 31B IT with 2 surgically removed transformer layers (layers 27 and 29), identified through a 3-phase empirical study. The remaining 58 layers are quantized to GGUF without any fine-tuning recovery.

Why This Model (research artifact)

A 3-phase layer study found:

  • Layer 27 (sliding attention): single-layer ablation REDUCES perplexity by 74%
  • Layer 29 (full attention): single-layer ablation REDUCES perplexity by 71%
  • Block ablation {27, 29}: PPL drops from 496 โ†’ 117 (-76%)

These layers were adding noise. Otter v3 is a research demonstration that some layers can be surgically removed with positive effect on perplexity โ€” but dynamic quantization (see IQ2_M above) dominates it on downstream benchmarks.

Files

File Size BPW CLI 7/7 (thinking) NL2Bash F1 (no-think)
otter-v3-Q3_K_M.gguf 14.1 GB 3.98 7/7 โœ“ 79.89%
otter-v3-IQ3_XS.gguf 12.1 GB 3.41 6/7 78.72%

NL2Bash Benchmark Results โ€” Full Comparison

50 examples from the official NL2Bash test split (Stanford/Tellina), reasoning OFF:

Model Size Layers Char-F1 BLEU-1 BLEU-4 EM
Unsloth UD-IQ3_XXS 11.84 GB 60 85.06% 46.30 20.26 12%
Base f16 (full precision) 61.4 GB 60 84.76% 43.94 21.02 12%
Base IQ3_XS (sibling repo) 13.1 GB 60 84.76% 42.77 18.95 8%
โœจ IQ2_M (CLI imatrix, recommended) 10.17 GB 60 84.71% 44.72 22.39 12%
Unsloth UD-IQ2_M 10.75 GB 60 84.02% 42.38 18.64 10%
Otter-v3 Q3_K_M (this repo) 14.1 GB 58 79.89% 35.09 12.80 0%
Otter-v3 IQ3_XS (this repo) 12.1 GB 58 78.72% 37.01 15.14 4%
Base Q2_K (3rd party) 11.0 GB 60 58.60% 18.82 6.92 0%

Honest conclusion: On NL2Bash, layer pruning without recovery training is dominated by dynamic quantization. Our custom IQ2_M at 10.17 GB scores 6 F1 points higher than Otter-v3 IQ3_XS at 12.1 GB. Keep Otter as a study exhibit; deploy the IQ2_M.

Full study + raw benchmark data: https://huggingface.co/datasets/KikoCis/gemma4-31b-layer-study Interactive site: https://kikocisbot.github.io/gemma4-31b-study (and soon https://otter.utopiaia.com)

CLI Benchmark (7/7 with thinking)

Test Otter v3 Q3_K_M
Install neofetch on Void Linux sudo xbps-install -S neofetch โœ“
Install htop on Ubuntu sudo apt update && sudo apt install htop โœ“
Search ripgrep on Arch pacman -Ss ripgrep โœ“
Search packages on Void xbps-query -Rs <package> โœ“
Add cargo to PATH in zsh (multi-step zshrc edit) โœ“
Install jq on macOS brew install jq โœ“
Grep TODO in /var/www grep -r "TODO" /var/www โœ“

Layer Analysis Methodology

Phase A โ€” Block Influence (BI Score)

Measured 1 - cosine_similarity(h_in, h_out) for each layer across 300 prompts in 6 categories.

Phase B โ€” Logit Lens

Projected each layer's hidden state through the final norm + lm_head. Decision happens in layers 56-59.

Phase C โ€” Single & Block Ablation

Disabled each layer individually and measured perplexity on held-out text. Identified layers 27 and 29 as net-negative.

Full study + raw measurements + interactive visualization: https://huggingface.co/datasets/KikoCis/gemma4-31b-layer-study

Usage

# llama.cpp with thinking (recommended for best results with this model)
llama-cli -m otter-v3-Q3_K_M.gguf -cnv -ngl 99 --ctx-size 8192 \
  --reasoning on --reasoning-budget 512

Quantization Details

  • Tool: llama.cpp (build 0d049d6)
  • Q3_K_M: No imatrix (mixed q3_K/q4_K/q5_K)
  • IQ3_XS: Custom imatrix computed from 200 chunks on the 58-layer model
  • Source: Custom 58-layer safetensors (base model minus layers 27, 29)

Hardware

  • Recommended RAM: 16GB+ (Q3_K_M) or 14GB+ (IQ3_XS)
  • Apple Silicon: ~22 tok/s on M4 Max with full GPU offload

Limitations

  • Requires reasoning enabled for best results on complex tasks
  • Without thinking, loses ~5 F1 vs base on NL2Bash (dominated by IQ2_M)
  • 58-layer architecture is non-standard; some frameworks may need updates
  • No fine-tuning recovery applied โ€” pure structural pruning

Related Resources


Real-World Agent Test Warning (April 2026)

Benchmark scores do not predict agent capability. In Docker-based autonomous testing, fine-tuned E4B models (95% BFCL) scored 0/10 while the unfine-tuned base scored 6/10. Fine-tuning for BFCL destroyed general reasoning (error recovery, strategy adaptation, anti-repetition). Fine-tuned E4B models have been withdrawn.

For autonomous agent tasks, use the base Gemma 4 model or a larger model at higher BPW. See: The Benchmark Trap โ€” Full Study

Downloads last month
2,143
GGUF
Model size
30B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support