dervig's picture
Add contamination check section: zero overlap with NorEval test sets
6efd3b7 verified
metadata
license: apache-2.0
language:
  - 'no'
  - nb
  - nn
  - en
base_model: google/gemma-4-31B-it
library_name: transformers
tags:
  - norwegian
  - norsk
  - bokmål
  - nynorsk
  - gemma4
  - noreval
  - pissa
  - lora
  - sft
  - text-generation-inference
pipeline_tag: text-generation
datasets:
  - NbAiLab/torgersen-alpaca
  - NbAiLab/norwegian-alpaca
  - NbAiLab/nynorsk_dpo
  - NbAiLab/nb-global-mmlu
  - NbAiLab/ndla_npk_conversational_nb_to_nn_tags_balanced
model-index:
  - name: m51Lab-NorskGemma4-31B
    results:
      - task:
          type: multiple-choice
          name: NorEval (Norwegian LLM Benchmark)
        dataset:
          type: ltgoslo/noreval
          name: NorEval
        metrics:
          - type: accuracy
            value: 0.836
            name: NorEval Average (best-of-5)
          - type: accuracy
            value: 0.854
            name: NorCommonsenseQA BM
          - type: accuracy
            value: 0.737
            name: NorCommonsenseQA NN
          - type: accuracy
            value: 0.965
            name: NorOpenBookQA BM
          - type: accuracy
            value: 0.944
            name: NorOpenBookQA NN
          - type: accuracy
            value: 0.857
            name: NorTruthfulQA BM
          - type: accuracy
            value: 0.93
            name: NorTruthfulQA NN
          - type: accuracy
            value: 0.709
            name: NRK Quiz QA BM
          - type: accuracy
            value: 0.696
            name: NRK Quiz QA NN

m51Lab-NorskGemma4-31B

Norway's top-scoring open-source language model on NorEval.

Built by m51.ai Lab through surgical fine-tuning of Google Gemma 4 31B-it for Norwegian (Bokmaal and Nynorsk).

Model Params NorEval Avg License
m51Lab-NorskGemma4-31B 31B 0.836 Apache 2.0
m51Lab-NorskMistral-119B 119B MoE 0.764 Apache 2.0
NorMistral-11B-thinking 11B 0.731

Quantized GGUF versions for local inference: m51Lab-NorskGemma4-31B-GGUF

Benchmark Results

Evaluated on NorEval (ACL 2025) — the standard benchmark for Norwegian language models. Protocol: 8 tasks, 5 prompt templates per task (best-of-5), loglikelihood scoring, full test sets, apply_chat_template=True.

Task m51Lab-NorskGemma4-31B m51Lab-NorskMistral-119B NorMistral-11B
NorCommonsenseQA (BM) 0.854 0.717 ~0.707
NorCommonsenseQA (NN) 0.737 0.632 ~0.642
NorOpenBookQA (BM) 0.965 0.957 ~0.790
NorOpenBookQA (NN) 0.944 0.933 ~0.820
NorTruthfulQA (BM) 0.857 0.771 ~0.480
NorTruthfulQA (NN) 0.930 0.825 ~0.740
NRK Quiz QA (BM) 0.709 0.643 ~0.640
NRK Quiz QA (NN) 0.696 0.636 ~0.720
Average 0.836 0.764 ~0.731

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "dervig/m51Lab-NorskGemma4-31B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",  # Required for global attention layers
)

messages = [
    {"role": "user", "content": "Kva er hovudstaden i Noreg?"}
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Requirements

  • GPU memory: ~64 GB for BF16 inference (1x A100 80GB or 2x A100 40GB)
  • attn_implementation="eager": Required because global attention layers use head_dim=512, which is incompatible with Flash Attention 2
  • transformers >= 5.5.0, torch >= 2.6.0

Training Details

This model was created through a careful, surgical fine-tuning process — informed by 5 prior failed SFT attempts on smaller Gemma 4 variants (4B dense and 26B MoE) that all degraded performance.

What Made This Attempt Different

Problem in prior attempts Solution here
96K training examples caused inter-domain conflicts 3,230 curated examples
44% translation data destroyed reasoning 0% translation
Random LoRA init wasted gradient budget on knowledge directions PiSSA (SVD-based init)
All layers targeted, harming truthfulness Only 50/60 sliding layers (global layers frozen)
No forgetting protection 5% rehearsal data (Wikipedia + math/code)
Learning rate too high (1e-4 to 2e-4) LR = 5e-6 (20-40x lower)

Training Configuration

Parameter Value
Base model google/gemma-4-31B-it (30.7B params)
Method PiSSA LoRA (r=8, alpha=16) + IPO preference optimization
LoRA targets Sliding-layer q_proj + v_proj only (50 of 60 layers)
Frozen layers 10 global attention layers (head_dim=512) — protects truthfulness
Trainable params 9,216,000 (0.03% of 31.3B)
SFT data 3,230 curated examples (67% Bokmaal, 31% Nynorsk, 2% English rehearsal)
IPO data 1,502 preference pairs
Learning rate 5e-6 (SFT), 5e-7 (IPO)
NEFTune noise alpha = 5
Epochs 1 (SFT) + 1 (IPO)
Training time 26 min SFT + 17 min IPO on 2x H100
Total project compute ~$155

Architecture

Model class:     Gemma4ForConditionalGeneration (dense, no MoE)
Layers:          60 (50 sliding + 10 global, pattern 5:1)
Hidden size:     5376
Attention heads: 32 (16 KV-heads sliding, 4 KV-heads global)
Head dim:        256 (sliding) / 512 (global)
MLP:             21504 intermediate
Total params:    31.27B
Context:         256K tokens

Training Data Sources

Source Examples Purpose
Locally curated (commonsense, knowledge, truthfulness) 800 Norwegian language understanding
NbAiLab/torgersen-alpaca 500 Norwegian factual knowledge
NbAiLab/ndla_npk_balanced 600 Nynorsk vocabulary
NbAiLab/nb-global-mmlu 500 Reasoning, general knowledge
NbAiLab/norwegian-alpaca 400 Bokmaal reasoning
NbAiLab/nynorsk_dpo 400 Nynorsk alignment
Wikipedia (nb/nn/en) + math rehearsal 200 Forgetting protection

Contamination Check

We performed a formal contamination analysis comparing all 6,445 text segments from the training data against 18,124 test texts across all 8 NorEval tasks. Three methods were used: exact normalized matching, substring matching, and character-level n-gram overlap (50-gram and 30-gram).

Result: Zero contamination detected. No exact matches, no substring matches, and no suspicious n-gram overlaps (>30%) were found across any of the 8 NorEval tasks. The benchmark scores reflect genuine model performance.

Limitations

  • Inherits limitations and potential biases from the base Gemma 4 model
  • Optimized for NorEval benchmark tasks; real-world Norwegian capabilities may vary
  • Requires attn_implementation="eager" (global layers have head_dim=512, incompatible with Flash Attention 2)
  • The base model is multimodal (Gemma4ForConditionalGeneration); text-only inference requires mm_token_type_ids input — handled automatically by apply_chat_template
  • Not a "thinking" model — does not use structured chain-of-thought reasoning tokens

Acknowledgments and Credits

This model would not have been possible without the work of many teams and individuals:

Citation

@misc{m51lab2026norskgemma4,
  title={m51Lab-NorskGemma4-31B: Surgical Fine-Tuning of Gemma 4 for Norwegian},
  author={m51.ai Lab},
  year={2026},
  url={https://huggingface.co/dervig/m51Lab-NorskGemma4-31B},
}

Built by m51.ai Lab. Read the full build log and technical analysis on our blog.