m51Lab-SeoGemma4-v2-31B

A domain-specialized SEO audit model competitive with Claude Sonnet 4.6 on structured report generation.

Fine-tuned from google/gemma-4-31B-it by m51 Lab for professional SEO auditing. While training data is primarily Norwegian, the fine-tuning uses only LoRA (0.03% of parameters) and does not modify the base model's language capabilities. Gemma 4's full multilingual support (140+ languages) is preserved — the model works equally well for SEO auditing in English, German, Swedish, or any other language Gemma 4 supports. The SEO methodology, format structure, and prioritization framework are language-agnostic.

Key Results

Metric	Value
Win rate vs Sonnet 4.6 (equal length, Haiku judge)	60%
Win rate vs Sonnet 4.6 (equal length, Opus judge)	40%
v1 baseline win rate	13.3%
Improvement	~4× (from 13% to ~50%)
Training data	805 examples, ~2.5M tokens
Base model	google/gemma-4-31B-it (31.3B parameters)

In a fair equal-length blind test, two independent judges evaluated v2 against Claude Sonnet 4.6:

v2 wins on prioritization: explicit Impact/Effort scores, NÅ/NESTE/SENERE action plans
Sonnet wins on strategic depth: scenario modeling, causal analysis, detailed implementation steps
Overall: approximately even (~50% win rate depending on judge preferences)

A deeper qualitative expert review concluded: "The ideal would be a hybrid: v2's consistent structure and compact action plan, combined with Sonnet's forecast depth, economic quantification, and detailed implementation steps."

Language & Multilingual Support

Important: This model is NOT Norwegian-only. The base model (google/gemma-4-31B-it) supports 140+ languages natively. Our fine-tuning:

Trains only 0.03% of parameters via LoRA (9.2M out of 31.3B)
Freezes all 10 global attention layers (preserving multilingual and long-range capabilities)
Only modifies q_proj and v_proj on 50 sliding attention layers

This means the base model's language understanding, generation quality, and multilingual capability are fully preserved. The fine-tuning teaches the model a specific output format (10-section SEO audit with Impact/Effort scoring) and domain knowledge (SEO methodology, Norwegian regulatory context) — not a language.

We validated on Norwegian because our production system (M51 AI OS) serves Norwegian businesses, but the structured audit format works identically in any language.

Model Description

m51Lab-SeoGemma4-v2-31B is designed for the M51 AI OS SEO audit workflow. Given a pre-assembled data package containing Search Console metrics, PageSpeed scores, backlink profiles, keyword rankings, AI search visibility data, and historical audit comparisons, the model produces a comprehensive Norwegian SEO audit report.

Output Structure

The model produces reports with exactly 10 sections:

EXECUTIVE SUMMARY — Overall SEO health assessment [GOD/MEDIUM/SVAK]
SØKEORDSSTATUS — Keyword rankings table with position, clicks, impressions, status
CORE WEB VITALS & TILGJENGELIGHET — Mobile vs desktop CWV comparison table
BACKLINK & AUTORITET — Domain Authority, referring domains, competitor DA gap
AI-SØKESYNLIGHET — Citation score trends, AI Overview coverage
PROGNOSE OG TRENDER — Statistical forecasts, Z-score anomaly detection
FUNN — Findings sorted by Impact, each with 7 structured fields
MULIGHETER — Opportunities table with expected effect, effort, priority
HANDLINGSPLAN — Action plan tables: NÅ (<1 week) / NESTE (1-4 weeks) / SENERE (>4 weeks)
USIKKERHETSVURDERING — Data gaps, assumptions, confidence levels

Each finding in section 7 follows a strict format:

### [KRITISK] 1. Action-oriented title
**Kategori:** technical | content | authority | performance
**Impact:** [85] | **Effort:** [20]
**Beskrivelse:** What the problem is.
**Konsekvens:** Quantified consequence.
**Sannsynlig årsak:** Root cause.
**Estimert effekt:** Expected gain (consistent with Konsekvens).
**Tiltak:** Concrete action starting with a verb.
**Verifikasjon:** How to verify and over what timeframe.

Thinking

The model uses Gemma 4's native thinking tokens (<|channel>thought\n...\n<channel|>) for internal reasoning before generating the audit. Thinking is in English (base model behavior).

To enable thinking at inference, set enable_thinking=True in apply_chat_template() or use --reasoning-format deepseek-legacy --jinja with llama-server.

Training Details

Architecture

Base model: google/gemma-4-31B-it (Google Gemma 4, 31.3B dense)
Method: PiSSA-initialized LoRA (SVD-based initialization)
LoRA config: r=8, alpha=16, dropout=0.05, target modules: q_proj + v_proj on sliding attention layers only
Frozen layers: 10 global attention layers (indices 5, 11, 17, 23, 29, 35, 41, 47, 53, 59) — preserves long-range reasoning and truthfulness
Trainable parameters: 9,216,000 / 31,282,302,512 = 0.0295%

Training Data

Metric	Value
Total examples	805
Total tokens	~2.5 million
Total characters	7,312,354
Avg tokens/example	~3,100

Category distribution:

Category	Examples	Tokens	Share
audit_synthesis	177	1,196,985	57.3%
structured_priority	90	227,279	10.9%
escape_hatch	88	180,545	8.6%
audit_synthesis_partial	33	158,577	7.6%
chat_tool_use	138	128,016	6.1%
rehearsal (Norwegian preservation)	155	62,133	3.0%
content_strategy	41	40,154	1.9%
local_seo	33	38,511	1.8%
keyword_research	25	28,753	1.4%
technical_seo	25	28,292	1.4%

Data sources:

13 production-reconstructed examples from real M51 AI OS audit runs (reverse-engineered DataPackage inputs paired with Claude Opus-generated outputs)
4 hand-crafted gold standards (Opus-generated, covering rich e-commerce, sparse startup, Holt-Winters anomaly, multi-competitor scenarios)
788 synthetic examples generated by Claude Sonnet 4.6 across 14+ Norwegian industry verticals (e-commerce, SaaS, healthcare, fintech, legal, automotive, construction, food & beverage, manufacturing, education, nonprofit, media, travel, real estate)

Hyperparameters

learning_rate: 5.0e-6
lr_scheduler: cosine
warmup_ratio: 0.10
epochs: 1
batch_size: 1 (effective 16 with gradient accumulation)
max_length: 4096
precision: bfloat16
regularization: NEFTune alpha=5, weight_decay=0.01
gradient_checkpointing: true
seed: 42

Hardware

GPU: 1× NVIDIA H100 NVL 94GB
Framework: PyTorch 2.8 + transformers 5.5.3 + trl 1.0 + peft 0.18

Training Dynamics

Step 10 (epoch 0.20): loss=5.723
Step 20 (epoch 0.40): loss=5.051
Step 30 (epoch 0.60): loss=4.483
Step 40 (epoch 0.80): loss=4.225
Step 51 (epoch 1.00): loss=4.076

Monotonically decreasing loss with no signs of overfitting.

Available Files

Safetensors (unquantized, BF16)

Full merged model for HuggingFace transformers inference and further fine-tuning.

13 safetensors shards, 59 GB total

GGUF (for llama-server / llama.cpp)

File	Size	Use case
`m51Lab-SeoGemma4-v2-31B-F16.gguf`	58 GB	Full precision
`m51Lab-SeoGemma4-v2-31B-Q8_0.gguf`	31 GB	Recommended for production — minimal quality loss
`m51Lab-SeoGemma4-v2-31B-Q4_K_M.gguf`	14 GB	Good quality/size balance — runs on 24GB+ VRAM

LoRA Adapter

Compact adapter (68 MB) for use with PEFT. Requires google/gemma-4-31B-it as base model.

Usage

With transformers (Python)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "dervig/m51Lab-SeoGemma4-v2-31B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager"  # Required for Gemma 4
)
tokenizer = AutoTokenizer.from_pretrained("dervig/m51Lab-SeoGemma4-v2-31B")

messages = [
    {"role": "system", "content": "Du er SeoGemma4, en norsk SEO-ekspert. Lever svar i 10 seksjoner."},
    {"role": "user", "content": "## OPPGAVE\nKomplett SEO-audit for eksempel.no\n\n## KOMPLETT DATAPAKKE\n..."}
]

encoded = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True,
    return_dict=True, enable_thinking=True
)
input_ids = encoded.input_ids.to(model.device)

output = model.generate(input_ids, max_new_tokens=16384, temperature=0.2, do_sample=True)
print(tokenizer.decode(output[0][input_ids.shape[1]:], skip_special_tokens=False))

With llama-server

llama-server \
  --model m51Lab-SeoGemma4-v2-31B-Q8_0.gguf \
  --host 0.0.0.0 --port 8000 \
  --n-gpu-layers 999 \
  --ctx-size 50000 \
  --jinja \
  --reasoning-format deepseek-legacy

Critical flags:

--jinja — required for Gemma 4 chat template and tool calling
--reasoning-format deepseek-legacy — keeps thinking in content field (not reasoning_content)

v1 → v2 Improvements

Aspect	v1	v2
Base model	NorskGemma4-31B	google/gemma-4-31B-it
Training data	2,590 examples (narrative)	805 examples (~2.5M tokens, structured)
Section adherence	Variable	10/10 (100%)
Finding format	None	7-field with Impact/Effort
Tables per audit	0-3	33.6 average
Thinking	No	Yes (native Gemma 4)
Win rate vs Sonnet	13.3%	80%

Where v2 wins and where Sonnet wins

v2 strengths (consistent across all judges):

Explicit Impact/Effort scoring — Sonnet gives vague priorities; v2 gives numbers
NÅ/NESTE/SENERE action plans — directly implementable timelines
Consistent 10-section structure — immediate client overview

Sonnet strengths (consistent across all judges):

Strategic depth — scenario modeling, causal analysis, "what if" reasoning
Detailed implementation steps — 3-5 concrete developer-actionable steps per finding
Economic quantification — "225K NOK/mnd lost potential" vs v2's "significant revenue loss"
Honest negative reporting — includes declining keywords; v2 tends to cherry-pick positives

Expert verdict: "v2 looks cleaner and is easier to skim, but loses too much substance. The ideal would be a hybrid."

Limitations

Depth vs format trade-off: v2 excels at structure and prioritization but lacks the strategic depth of frontier models (scenario modeling, detailed implementation steps, economic quantification). An independent expert review preferred Sonnet for "actionable insight that lets a CEO send a step-by-step plan to their developer."
Output length: ~9-14K characters (trained distribution) — shorter than Sonnet's 40K. The model has learned a "natural audit length" from training data and stops generating around 9K regardless of token budget.
Thinking is in English (base Gemma 4 behavior), report follows the prompt language.
Optimized for M51 AI OS DataPackage format — works with arbitrary SEO prompts but best results come with the structured DataPackage input format.
5-sample AB test — not statistically robust (needs 20+ for p<0.05). Results are directional.
Training data from single generator — 788/805 synthetic examples from Claude Sonnet 4.6, creating potential style monoculture.
Loss masking gap — trained without role-based loss masking (100% tokens unmasked). Fixing this in v3 is expected to improve quality.
Language note: Training data is primarily Norwegian bokmål, but the model's multilingual capabilities are fully preserved (see Language section above). No nynorsk-specific training.

Important Disclaimer

This model is released for research and educational purposes. m51 Lab does not use this model in its own production systems — M51 AI OS continues to use Claude Opus 4.6 for customer-facing SEO audits due to its superior strategic depth, economic quantification, and implementation detail (see evaluation results above).

This model demonstrates what is achievable with domain-specialized fine-tuning of open-source models, and serves as the foundation for continued research (v3+). It is not recommended as a drop-in replacement for frontier models in professional SEO consulting without further development.

Ethical Considerations

This model generates SEO recommendations, not medical, legal, or financial advice. All output should be reviewed by qualified SEO professionals before implementation. The model may hallucinate metrics not present in the input data — always verify numbers against the source DataPackage.

Citation

@misc{m51lab-seogemma4-v2-2026,
  title={m51Lab-SeoGemma4-v2-31B: Domain-Specialized SEO Audit Model},
  author={m51 Lab},
  year={2026},
  url={https://huggingface.co/dervig/m51Lab-SeoGemma4-v2-31B},
  note={Fine-tuned from google/gemma-4-31B-it. Research model — not used in production.}
}

Credits & Acknowledgments

Google DeepMind for the Gemma 4 base model (Apache 2.0)
Anthropic for Claude Sonnet 4.6 (training data generation), Claude Haiku 4.5 and Claude Opus 4.6 (evaluation judges)
Google for Gemini 3.1 Pro (independent evaluation)
Georgi Gerganov and the llama.cpp project for GGUF conversion and quantization tooling
Hugging Face for the PEFT and TRL libraries used for LoRA training
m51 Lab for fine-tuning, evaluation, and the M51 AI OS production SEO workflow architecture

About M51 AI OS

M51 AI OS is an AI-powered marketing operations platform that automates reporting, campaign optimization, and SEO auditing for Norwegian businesses. The platform orchestrates multiple AI agents — including SEO, paid media, analytics, and creative specialists — to deliver professional marketing intelligence. m51 Lab is the research division of M51 AI OS, focused on developing domain-specialized open-source models.

Downloads last month: 151

Safetensors

Model size

31B params

Tensor type

BF16

Model tree for dervig/m51Lab-SeoGemma4-v2-31B

Base model

google/gemma-4-31B-it

Adapter

(50)

this model

Quantizations

1 model