Add contamination check section: zero overlap with NorEval test sets

6efd3b7 verified 14 days ago

9.77 kB

license: apache-2.0
language:
  - 'no'
  - nb
  - nn
  - en
base_model: google/gemma-4-31B-it
library_name: transformers
tags:
  - norwegian
  - norsk
  - bokmål
  - nynorsk
  - gemma4
  - noreval
  - pissa
  - lora
  - sft
  - text-generation-inference
pipeline_tag: text-generation
datasets:
  - NbAiLab/torgersen-alpaca
  - NbAiLab/norwegian-alpaca
  - NbAiLab/nynorsk_dpo
  - NbAiLab/nb-global-mmlu
  - NbAiLab/ndla_npk_conversational_nb_to_nn_tags_balanced
model-index:
  - name: m51Lab-NorskGemma4-31B
    results:
      - task:
          type: multiple-choice
          name: NorEval (Norwegian LLM Benchmark)
        dataset:
          type: ltgoslo/noreval
          name: NorEval
        metrics:
          - type: accuracy
            value: 0.836
            name: NorEval Average (best-of-5)
          - type: accuracy
            value: 0.854
            name: NorCommonsenseQA BM
          - type: accuracy
            value: 0.737
            name: NorCommonsenseQA NN
          - type: accuracy
            value: 0.965
            name: NorOpenBookQA BM
          - type: accuracy
            value: 0.944
            name: NorOpenBookQA NN
          - type: accuracy
            value: 0.857
            name: NorTruthfulQA BM
          - type: accuracy
            value: 0.93
            name: NorTruthfulQA NN
          - type: accuracy
            value: 0.709
            name: NRK Quiz QA BM
          - type: accuracy
            value: 0.696
            name: NRK Quiz QA NN

m51Lab-NorskGemma4-31B

Norway's top-scoring open-source language model on NorEval.

Built by m51.ai Lab through surgical fine-tuning of Google Gemma 4 31B-it for Norwegian (Bokmaal and Nynorsk).

Model	Params	NorEval Avg	License
m51Lab-NorskGemma4-31B	31B	0.836	Apache 2.0
m51Lab-NorskMistral-119B	119B MoE	0.764	Apache 2.0
NorMistral-11B-thinking	11B	0.731	—

Quantized GGUF versions for local inference: m51Lab-NorskGemma4-31B-GGUF

Benchmark Results

Evaluated on NorEval (ACL 2025) — the standard benchmark for Norwegian language models. Protocol: 8 tasks, 5 prompt templates per task (best-of-5), loglikelihood scoring, full test sets, apply_chat_template=True.

Task	m51Lab-NorskGemma4-31B	m51Lab-NorskMistral-119B	NorMistral-11B
NorCommonsenseQA (BM)	0.854	0.717	~0.707
NorCommonsenseQA (NN)	0.737	0.632	~0.642
NorOpenBookQA (BM)	0.965	0.957	~0.790
NorOpenBookQA (NN)	0.944	0.933	~0.820
NorTruthfulQA (BM)	0.857	0.771	~0.480
NorTruthfulQA (NN)	0.930	0.825	~0.740
NRK Quiz QA (BM)	0.709	0.643	~0.640
NRK Quiz QA (NN)	0.696	0.636	~0.720
Average	0.836	0.764	~0.731

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "dervig/m51Lab-NorskGemma4-31B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="eager",  # Required for global attention layers
)

messages = [
    {"role": "user", "content": "Kva er hovudstaden i Noreg?"}
]

inputs = tokenizer.apply_chat_template(
    messages, return_tensors="pt", add_generation_prompt=True
).to(model.device)

outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Requirements

GPU memory: ~64 GB for BF16 inference (1x A100 80GB or 2x A100 40GB)
attn_implementation="eager": Required because global attention layers use head_dim=512, which is incompatible with Flash Attention 2
transformers >= 5.5.0, torch >= 2.6.0

Training Details

This model was created through a careful, surgical fine-tuning process — informed by 5 prior failed SFT attempts on smaller Gemma 4 variants (4B dense and 26B MoE) that all degraded performance.

What Made This Attempt Different

Problem in prior attempts	Solution here
96K training examples caused inter-domain conflicts	3,230 curated examples
44% translation data destroyed reasoning	0% translation
Random LoRA init wasted gradient budget on knowledge directions	PiSSA (SVD-based init)
All layers targeted, harming truthfulness	Only 50/60 sliding layers (global layers frozen)
No forgetting protection	5% rehearsal data (Wikipedia + math/code)
Learning rate too high (1e-4 to 2e-4)	LR = 5e-6 (20-40x lower)

Training Configuration

Parameter	Value
Base model	google/gemma-4-31B-it (30.7B params)
Method	PiSSA LoRA (r=8, alpha=16) + IPO preference optimization
LoRA targets	Sliding-layer `q_proj` + `v_proj` only (50 of 60 layers)
Frozen layers	10 global attention layers (`head_dim=512`) — protects truthfulness
Trainable params	9,216,000 (0.03% of 31.3B)
SFT data	3,230 curated examples (67% Bokmaal, 31% Nynorsk, 2% English rehearsal)
IPO data	1,502 preference pairs
Learning rate	5e-6 (SFT), 5e-7 (IPO)
NEFTune noise	alpha = 5
Epochs	1 (SFT) + 1 (IPO)
Training time	26 min SFT + 17 min IPO on 2x H100
Total project compute	~$155

Architecture

Model class:     Gemma4ForConditionalGeneration (dense, no MoE)
Layers:          60 (50 sliding + 10 global, pattern 5:1)
Hidden size:     5376
Attention heads: 32 (16 KV-heads sliding, 4 KV-heads global)
Head dim:        256 (sliding) / 512 (global)
MLP:             21504 intermediate
Total params:    31.27B
Context:         256K tokens

Training Data Sources

Source	Examples	Purpose
Locally curated (commonsense, knowledge, truthfulness)	800	Norwegian language understanding
NbAiLab/torgersen-alpaca	500	Norwegian factual knowledge
NbAiLab/ndla_npk_balanced	600	Nynorsk vocabulary
NbAiLab/nb-global-mmlu	500	Reasoning, general knowledge
NbAiLab/norwegian-alpaca	400	Bokmaal reasoning
NbAiLab/nynorsk_dpo	400	Nynorsk alignment
Wikipedia (nb/nn/en) + math rehearsal	200	Forgetting protection

Contamination Check

We performed a formal contamination analysis comparing all 6,445 text segments from the training data against 18,124 test texts across all 8 NorEval tasks. Three methods were used: exact normalized matching, substring matching, and character-level n-gram overlap (50-gram and 30-gram).

Result: Zero contamination detected. No exact matches, no substring matches, and no suspicious n-gram overlaps (>30%) were found across any of the 8 NorEval tasks. The benchmark scores reflect genuine model performance.

Limitations

Inherits limitations and potential biases from the base Gemma 4 model
Optimized for NorEval benchmark tasks; real-world Norwegian capabilities may vary
Requires attn_implementation="eager" (global layers have head_dim=512, incompatible with Flash Attention 2)
The base model is multimodal (Gemma4ForConditionalGeneration); text-only inference requires mm_token_type_ids input — handled automatically by apply_chat_template
Not a "thinking" model — does not use structured chain-of-thought reasoning tokens

Acknowledgments and Credits

This model would not have been possible without the work of many teams and individuals:

Google DeepMind — for the Gemma 4 model family and the Apache 2.0 license that enables open research
NbAiLab (National Library of Norway AI Lab) — for building and openly sharing the Norwegian NLP datasets that made fine-tuning possible: norwegian-alpaca, torgersen-alpaca, ndla_npk, nynorsk_dpo, nb-global-mmlu, and many more
Language Technology Group (LTG), University of Oslo — for creating and publishing the NorEval benchmark (ACL 2025), providing the Norwegian NLP community with a standardized evaluation framework
NorMistral / NorwAI / norallm teams — for pioneering Norwegian LLM development and establishing baselines that guided this work
Hugging Face — for the transformers, PEFT, and TRL libraries
PiSSA authors (Meng et al., 2024) — for the Principal Singular Values and Singular Vectors Adaptation method
RunPod — for accessible GPU infrastructure

Citation

@misc{m51lab2026norskgemma4,
  title={m51Lab-NorskGemma4-31B: Surgical Fine-Tuning of Gemma 4 for Norwegian},
  author={m51.ai Lab},
  year={2026},
  url={https://huggingface.co/dervig/m51Lab-NorskGemma4-31B},
}

Built by m51.ai Lab. Read the full build log and technical analysis on our blog.