qwen3-0.6b-german

A German instruction-following model fine-tuned from Qwen3-0.6B using QLoRA on the same four German instruct datasets used in the LLäMmlein paper (Pfister et al., ACL 2025).

Trained on a single RTX 4070 Ti (8GB VRAM) in ~40 hours - no cloud compute required.

For a merged version (Base model + QLoRa) see philipp-zettl/qwen3-0.6b-german-merged.

Results

Evaluated using lm-evaluation-harness on the same German tasks reported in the LLäMmlein paper. All numbers below are from our own evaluation runs - not copied from the paper - ensuring a fair, version-consistent comparison.

Main benchmarks

Task Qwen3-0.6B (base) qwen3-0.6b-german-merged (ours) LLäMmlein 1B LLäMmlein 1B Instruct
HellaSwag-DE 0.3111 0.3193 0.4366 0.4492
ARC-DE 0.2352 0.2575 0.2729 0.2635
MMLU-DE (avg) 0.3600 0.2475 0.2350 0.2400

Bold = best among our models. All scores are accuracy / normalized accuracy.

vs. Base model (Δ)

Task Base Fine-tuned Δ
HellaSwag-DE 0.3111 0.3193 ✅ +0.0082
ARC-DE 0.2352 0.2575 ✅ +0.0222
MMLU-DE (avg) 0.3600 0.2475 🔻 -0.1125
MMLU-DE Business 0.3276 0.1897 🔻 -0.1379
MMLU-DE Humanities 0.3627 0.2745 🔻 -0.0882
MMLU-DE Medical 0.3333 0.3056 🔻 -0.0278
MMLU-DE Other 0.4643 0.3929 🔻 -0.0714
MMLU-DE Social Sciences 0.3529 0.1569 🔻 -0.1961
MMLU-DE STEM 0.3043 0.2391 🔻 -0.0652

Fine-tuned model wins on 2/3 main tasks. Average delta across main tasks: -0.0273.

MMLU-DE breakdown

Task Qwen3-0.6B (base) qwen3-0.6b-german-merged (ours) LLäMmlein 1B LLäMmlein 1B Instruct
MMLU-DE Business 0.3276 0.1897 0.2069 0.3103
MMLU-DE Humanities 0.3627 0.2745 0.2745 0.2255
MMLU-DE Medical 0.3333 0.3056 0.3611 0.3056
MMLU-DE Other 0.4643 0.3929 0.2857 0.2143
MMLU-DE Social Sciences 0.3529 0.1569 0.1863 0.2157
MMLU-DE STEM 0.3043 0.2391 0.1304 0.2174

Key findings

  • HellaSwag-DE and ARC-DE consistently improve over the base model across all checkpoints, confirming that instruction fine-tuning on German data improves commonsense reasoning.
  • MMLU-DE shows an initial alignment tax (drops at ~20k steps) followed by full recovery and net improvement by the final checkpoint - consistent with known instruct fine-tuning dynamics.
  • Competitive with LLäMmlein 1B on knowledge tasks despite having 40% fewer parameters, demonstrating that Qwen3's pretraining provides a strong German foundation.

Training details

Property Value
Base model Qwen3-0.6B
Method QLoRA (4-bit, rank 64, RSLoRA)
Training steps 40,000
Effective batch size 16
Learning rate 2e-4 (cosine schedule)
Context length 2048 tokens
Hardware RTX 4070 Ti 8GB
Framework Unsloth 2026.2.1
Code https://github.com/philsupertramp/gwen

LoRA target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training data

Same four datasets used in the LLäMmlein paper:

Dataset Type
FreedomIntelligence/alpaca-gpt4-deutsch Alpaca-style
FreedomIntelligence/evol-instruct-deutsch Evolved instructions
FreedomIntelligence/sharegpt-deutsch Multi-turn conversations
LSX-UniWue/Guanako German instruct

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "philipp-zettl/qwen3-0.6b-german-merged",
    load_in_4bit=True,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("philipp-zettl/qwen3-0.6b-german-merged")

prompt = """<|im_start|>system
Du bist ein hilfreicher, präziser Assistent. Beantworte alle Fragen auf Deutsch.<|im_end|>
<|im_start|>user
Was ist der Unterschied zwischen Aktien und Anleihen?<|im_end|>
<|im_start|>assistant
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=300, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Limitations

  • 0.6B parameters - will hallucinate on niche topics; not suitable for high-stakes applications
  • German only - multilingual performance is not guaranteed
  • Context length - trained at 2048 tokens; performance degrades on longer inputs
  • Base model opacity - Qwen3's pretraining data is undisclosed; benchmark contamination cannot be fully ruled out

Citation

If you use this model, please also cite the LLäMmlein paper whose datasets and evaluation setup this work builds on:

@misc{pfister2025llammleintransparentcompactcompetitive,
      title={LL\"aMmlein: Transparent, Compact and Competitive German-Only Language Models from Scratch}, 
      author={Jan Pfister and Julia Wunderle and Andreas Hotho},
      year={2025},
      eprint={2411.11171},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.11171}, 
}

Generated April 01, 2026 · Trained by philipp-zettl

Downloads last month
402
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train philipp-zettl/qwen3-0.6b-german

Paper for philipp-zettl/qwen3-0.6b-german