m51Lab-SeoGemma4-v2-31B

A domain-specialized SEO audit model competitive with Claude Sonnet 4.6 on structured report generation.

Fine-tuned from google/gemma-4-31B-it by m51 Lab for professional SEO auditing. While training data is primarily Norwegian, the fine-tuning uses only LoRA (0.03% of parameters) and does not modify the base model's language capabilities. Gemma 4's full multilingual support (140+ languages) is preserved — the model works equally well for SEO auditing in English, German, Swedish, or any other language Gemma 4 supports. The SEO methodology, format structure, and prioritization framework are language-agnostic.

Key Results

Metric Value
Win rate vs Sonnet 4.6 (equal length, Haiku judge) 60%
Win rate vs Sonnet 4.6 (equal length, Opus judge) 40%
v1 baseline win rate 13.3%
Improvement ~4× (from 13% to ~50%)
Training data 805 examples, ~2.5M tokens
Base model google/gemma-4-31B-it (31.3B parameters)

In a fair equal-length blind test, two independent judges evaluated v2 against Claude Sonnet 4.6:

  • v2 wins on prioritization: explicit Impact/Effort scores, NÅ/NESTE/SENERE action plans
  • Sonnet wins on strategic depth: scenario modeling, causal analysis, detailed implementation steps
  • Overall: approximately even (~50% win rate depending on judge preferences)

A deeper qualitative expert review concluded: "The ideal would be a hybrid: v2's consistent structure and compact action plan, combined with Sonnet's forecast depth, economic quantification, and detailed implementation steps."

Language & Multilingual Support

Important: This model is NOT Norwegian-only. The base model (google/gemma-4-31B-it) supports 140+ languages natively. Our fine-tuning:

  • Trains only 0.03% of parameters via LoRA (9.2M out of 31.3B)
  • Freezes all 10 global attention layers (preserving multilingual and long-range capabilities)
  • Only modifies q_proj and v_proj on 50 sliding attention layers

This means the base model's language understanding, generation quality, and multilingual capability are fully preserved. The fine-tuning teaches the model a specific output format (10-section SEO audit with Impact/Effort scoring) and domain knowledge (SEO methodology, Norwegian regulatory context) — not a language.

We validated on Norwegian because our production system (M51 AI OS) serves Norwegian businesses, but the structured audit format works identically in any language.

Model Description

m51Lab-SeoGemma4-v2-31B is designed for the M51 AI OS SEO audit workflow. Given a pre-assembled data package containing Search Console metrics, PageSpeed scores, backlink profiles, keyword rankings, AI search visibility data, and historical audit comparisons, the model produces a comprehensive Norwegian SEO audit report.

Output Structure

The model produces reports with exactly 10 sections:

  1. EXECUTIVE SUMMARY — Overall SEO health assessment [GOD/MEDIUM/SVAK]
  2. SØKEORDSSTATUS — Keyword rankings table with position, clicks, impressions, status
  3. CORE WEB VITALS & TILGJENGELIGHET — Mobile vs desktop CWV comparison table
  4. BACKLINK & AUTORITET — Domain Authority, referring domains, competitor DA gap
  5. AI-SØKESYNLIGHET — Citation score trends, AI Overview coverage
  6. PROGNOSE OG TRENDER — Statistical forecasts, Z-score anomaly detection
  7. FUNN — Findings sorted by Impact, each with 7 structured fields
  8. MULIGHETER — Opportunities table with expected effect, effort, priority
  9. HANDLINGSPLAN — Action plan tables: NÅ (<1 week) / NESTE (1-4 weeks) / SENERE (>4 weeks)
  10. USIKKERHETSVURDERING — Data gaps, assumptions, confidence levels

Each finding in section 7 follows a strict format:

### [KRITISK] 1. Action-oriented title
**Kategori:** technical | content | authority | performance
**Impact:** [85] | **Effort:** [20]
**Beskrivelse:** What the problem is.
**Konsekvens:** Quantified consequence.
**Sannsynlig årsak:** Root cause.
**Estimert effekt:** Expected gain (consistent with Konsekvens).
**Tiltak:** Concrete action starting with a verb.
**Verifikasjon:** How to verify and over what timeframe.

Thinking

The model uses Gemma 4's native thinking tokens (<|channel>thought\n...\n<channel|>) for internal reasoning before generating the audit. Thinking is in English (base model behavior).

To enable thinking at inference, set enable_thinking=True in apply_chat_template() or use --reasoning-format deepseek-legacy --jinja with llama-server.

Training Details

Architecture

  • Base model: google/gemma-4-31B-it (Google Gemma 4, 31.3B dense)
  • Method: PiSSA-initialized LoRA (SVD-based initialization)
  • LoRA config: r=8, alpha=16, dropout=0.05, target modules: q_proj + v_proj on sliding attention layers only
  • Frozen layers: 10 global attention layers (indices 5, 11, 17, 23, 29, 35, 41, 47, 53, 59) — preserves long-range reasoning and truthfulness
  • Trainable parameters: 9,216,000 / 31,282,302,512 = 0.0295%

Training Data

Metric Value
Total examples 805
Total tokens ~2.5 million
Total characters 7,312,354
Avg tokens/example ~3,100

Category distribution:

Category Examples Tokens Share
audit_synthesis 177 1,196,985 57.3%
structured_priority 90 227,279 10.9%
escape_hatch 88 180,545 8.6%
audit_synthesis_partial 33 158,577 7.6%
chat_tool_use 138 128,016 6.1%
rehearsal (Norwegian preservation) 155 62,133 3.0%
content_strategy 41 40,154 1.9%
local_seo 33 38,511 1.8%
keyword_research 25 28,753 1.4%
technical_seo 25 28,292 1.4%

Data sources:

  • 13 production-reconstructed examples from real M51 AI OS audit runs (reverse-engineered DataPackage inputs paired with Claude Opus-generated outputs)
  • 4 hand-crafted gold standards (Opus-generated, covering rich e-commerce, sparse startup, Holt-Winters anomaly, multi-competitor scenarios)
  • 788 synthetic examples generated by Claude Sonnet 4.6 across 14+ Norwegian industry verticals (e-commerce, SaaS, healthcare, fintech, legal, automotive, construction, food & beverage, manufacturing, education, nonprofit, media, travel, real estate)

Hyperparameters

learning_rate: 5.0e-6
lr_scheduler: cosine
warmup_ratio: 0.10
epochs: 1
batch_size: 1 (effective 16 with gradient accumulation)
max_length: 4096
precision: bfloat16
regularization: NEFTune alpha=5, weight_decay=0.01
gradient_checkpointing: true
seed: 42

Hardware

  • GPU: 1× NVIDIA H100 NVL 94GB
  • Framework: PyTorch 2.8 + transformers 5.5.3 + trl 1.0 + peft 0.18

Training Dynamics

Step 10 (epoch 0.20): loss=5.723
Step 20 (epoch 0.40): loss=5.051
Step 30 (epoch 0.60): loss=4.483
Step 40 (epoch 0.80): loss=4.225
Step 51 (epoch 1.00): loss=4.076

Monotonically decreasing loss with no signs of overfitting.

Available Files

Safetensors (unquantized, BF16)

Full merged model for HuggingFace transformers inference and further fine-tuning.

  • 13 safetensors shards, 59 GB total

GGUF (for llama-server / llama.cpp)

File Size Use case
m51Lab-SeoGemma4-v2-31B-F16.gguf 58 GB Full precision
m51Lab-SeoGemma4-v2-31B-Q8_0.gguf 31 GB Recommended for production — minimal quality loss
m51Lab-SeoGemma4-v2-31B-Q4_K_M.gguf 14 GB Good quality/size balance — runs on 24GB+ VRAM

LoRA Adapter

Compact adapter (68 MB) for use with PEFT. Requires google/gemma-4-31B-it as base model.

Usage

With transformers (Python)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "dervig/m51Lab-SeoGemma4-v2-31B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager"  # Required for Gemma 4
)
tokenizer = AutoTokenizer.from_pretrained("dervig/m51Lab-SeoGemma4-v2-31B")

messages = [
    {"role": "system", "content": "Du er SeoGemma4, en norsk SEO-ekspert. Lever svar i 10 seksjoner."},
    {"role": "user", "content": "## OPPGAVE\nKomplett SEO-audit for eksempel.no\n\n## KOMPLETT DATAPAKKE\n..."}
]

encoded = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True,
    return_dict=True, enable_thinking=True
)
input_ids = encoded.input_ids.to(model.device)

output = model.generate(input_ids, max_new_tokens=16384, temperature=0.2, do_sample=True)
print(tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=False))

With llama-server

llama-server \
  --model m51Lab-SeoGemma4-v2-31B-Q8_0.gguf \
  --host 0.0.0.0 --port 8000 \
  --n-gpu-layers 999 \
  --ctx-size 50000 \
  --jinja \
  --reasoning-format deepseek-legacy

Critical flags:

  • --jinja — required for Gemma 4 chat template and tool calling
  • --reasoning-format deepseek-legacy — keeps thinking in content field (not reasoning_content)

v1 → v2 Improvements

Aspect v1 v2
Base model NorskGemma4-31B google/gemma-4-31B-it
Training data 2,590 examples (narrative) 805 examples (~2.5M tokens, structured)
Section adherence Variable 10/10 (100%)
Finding format None 7-field with Impact/Effort
Tables per audit 0-3 33.6 average
Thinking No Yes (native Gemma 4)
Win rate vs Sonnet 13.3% 80%

Where v2 wins and where Sonnet wins

v2 strengths (consistent across all judges):

  1. Explicit Impact/Effort scoring — Sonnet gives vague priorities; v2 gives numbers
  2. NÅ/NESTE/SENERE action plans — directly implementable timelines
  3. Consistent 10-section structure — immediate client overview

Sonnet strengths (consistent across all judges):

  1. Strategic depth — scenario modeling, causal analysis, "what if" reasoning
  2. Detailed implementation steps — 3-5 concrete developer-actionable steps per finding
  3. Economic quantification — "225K NOK/mnd lost potential" vs v2's "significant revenue loss"
  4. Honest negative reporting — includes declining keywords; v2 tends to cherry-pick positives

Expert verdict: "v2 looks cleaner and is easier to skim, but loses too much substance. The ideal would be a hybrid."

Limitations

  • Depth vs format trade-off: v2 excels at structure and prioritization but lacks the strategic depth of frontier models (scenario modeling, detailed implementation steps, economic quantification). An independent expert review preferred Sonnet for "actionable insight that lets a CEO send a step-by-step plan to their developer."
  • Output length: ~9-14K characters (trained distribution) — shorter than Sonnet's 40K. The model has learned a "natural audit length" from training data and stops generating around 9K regardless of token budget.
  • Thinking is in English (base Gemma 4 behavior), report follows the prompt language.
  • Optimized for M51 AI OS DataPackage format — works with arbitrary SEO prompts but best results come with the structured DataPackage input format.
  • 5-sample AB test — not statistically robust (needs 20+ for p<0.05). Results are directional.
  • Training data from single generator — 788/805 synthetic examples from Claude Sonnet 4.6, creating potential style monoculture.
  • Loss masking gap — trained without role-based loss masking (100% tokens unmasked). Fixing this in v3 is expected to improve quality.
  • Language note: Training data is primarily Norwegian bokmål, but the model's multilingual capabilities are fully preserved (see Language section above). No nynorsk-specific training.

Important Disclaimer

This model is released for research and educational purposes. m51 Lab does not use this model in its own production systems — M51 AI OS continues to use Claude Opus 4.6 for customer-facing SEO audits due to its superior strategic depth, economic quantification, and implementation detail (see evaluation results above).

This model demonstrates what is achievable with domain-specialized fine-tuning of open-source models, and serves as the foundation for continued research (v3+). It is not recommended as a drop-in replacement for frontier models in professional SEO consulting without further development.

Ethical Considerations

This model generates SEO recommendations, not medical, legal, or financial advice. All output should be reviewed by qualified SEO professionals before implementation. The model may hallucinate metrics not present in the input data — always verify numbers against the source DataPackage.

Citation

@misc{m51lab-seogemma4-v2-2026,
  title={m51Lab-SeoGemma4-v2-31B: Domain-Specialized SEO Audit Model},
  author={m51 Lab},
  year={2026},
  url={https://huggingface.co/dervig/m51Lab-SeoGemma4-v2-31B},
  note={Fine-tuned from google/gemma-4-31B-it. Research model — not used in production.}
}

Credits & Acknowledgments

  • Google DeepMind for the Gemma 4 base model (Apache 2.0)
  • Anthropic for Claude Sonnet 4.6 (training data generation), Claude Haiku 4.5 and Claude Opus 4.6 (evaluation judges)
  • Google for Gemini 3.1 Pro (independent evaluation)
  • Georgi Gerganov and the llama.cpp project for GGUF conversion and quantization tooling
  • Hugging Face for the PEFT and TRL libraries used for LoRA training
  • m51 Lab for fine-tuning, evaluation, and the M51 AI OS production SEO workflow architecture

About M51 AI OS

M51 AI OS is an AI-powered marketing operations platform that automates reporting, campaign optimization, and SEO auditing for Norwegian businesses. The platform orchestrates multiple AI agents — including SEO, paid media, analytics, and creative specialists — to deliver professional marketing intelligence. m51 Lab is the research division of M51 AI OS, focused on developing domain-specialized open-source models.

Downloads last month
151
Safetensors
Model size
31B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dervig/m51Lab-SeoGemma4-v2-31B

Adapter
(50)
this model
Quantizations
1 model