qwen3-0.6b-german
A German instruction-following model fine-tuned from Qwen3-0.6B using QLoRA on the same four German instruct datasets used in the LLäMmlein paper (Pfister et al., ACL 2025).
Trained on a single RTX 4070 Ti (8GB VRAM) in ~40 hours - no cloud compute required.
For a merged version (Base model + QLoRa) see philipp-zettl/qwen3-0.6b-german-merged.
Results
Evaluated using lm-evaluation-harness on the same German tasks reported in the LLäMmlein paper. All numbers below are from our own evaluation runs - not copied from the paper - ensuring a fair, version-consistent comparison.
Main benchmarks
| Task | Qwen3-0.6B (base) | qwen3-0.6b-german-merged (ours) | LLäMmlein 1B | LLäMmlein 1B Instruct |
|---|---|---|---|---|
| HellaSwag-DE | 0.3111 | 0.3193 | 0.4366 | 0.4492 |
| ARC-DE | 0.2352 | 0.2575 | 0.2729 | 0.2635 |
| MMLU-DE (avg) | 0.3600 | 0.2475 | 0.2350 | 0.2400 |
Bold = best among our models. All scores are accuracy / normalized accuracy.
vs. Base model (Δ)
| Task | Base | Fine-tuned | Δ |
|---|---|---|---|
| HellaSwag-DE | 0.3111 | 0.3193 | ✅ +0.0082 |
| ARC-DE | 0.2352 | 0.2575 | ✅ +0.0222 |
| MMLU-DE (avg) | 0.3600 | 0.2475 | 🔻 -0.1125 |
| MMLU-DE Business | 0.3276 | 0.1897 | 🔻 -0.1379 |
| MMLU-DE Humanities | 0.3627 | 0.2745 | 🔻 -0.0882 |
| MMLU-DE Medical | 0.3333 | 0.3056 | 🔻 -0.0278 |
| MMLU-DE Other | 0.4643 | 0.3929 | 🔻 -0.0714 |
| MMLU-DE Social Sciences | 0.3529 | 0.1569 | 🔻 -0.1961 |
| MMLU-DE STEM | 0.3043 | 0.2391 | 🔻 -0.0652 |
Fine-tuned model wins on 2/3 main tasks.
Average delta across main tasks: -0.0273.
MMLU-DE breakdown
| Task | Qwen3-0.6B (base) | qwen3-0.6b-german-merged (ours) | LLäMmlein 1B | LLäMmlein 1B Instruct |
|---|---|---|---|---|
| MMLU-DE Business | 0.3276 | 0.1897 | 0.2069 | 0.3103 |
| MMLU-DE Humanities | 0.3627 | 0.2745 | 0.2745 | 0.2255 |
| MMLU-DE Medical | 0.3333 | 0.3056 | 0.3611 | 0.3056 |
| MMLU-DE Other | 0.4643 | 0.3929 | 0.2857 | 0.2143 |
| MMLU-DE Social Sciences | 0.3529 | 0.1569 | 0.1863 | 0.2157 |
| MMLU-DE STEM | 0.3043 | 0.2391 | 0.1304 | 0.2174 |
Key findings
- HellaSwag-DE and ARC-DE consistently improve over the base model across all checkpoints, confirming that instruction fine-tuning on German data improves commonsense reasoning.
- MMLU-DE shows an initial alignment tax (drops at ~20k steps) followed by full recovery and net improvement by the final checkpoint - consistent with known instruct fine-tuning dynamics.
- Competitive with LLäMmlein 1B on knowledge tasks despite having 40% fewer parameters, demonstrating that Qwen3's pretraining provides a strong German foundation.
Training details
| Property | Value |
|---|---|
| Base model | Qwen3-0.6B |
| Method | QLoRA (4-bit, rank 64, RSLoRA) |
| Training steps | 40,000 |
| Effective batch size | 16 |
| Learning rate | 2e-4 (cosine schedule) |
| Context length | 2048 tokens |
| Hardware | RTX 4070 Ti 8GB |
| Framework | Unsloth 2026.2.1 |
| Code | https://github.com/philsupertramp/gwen |
LoRA target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Training data
Same four datasets used in the LLäMmlein paper:
| Dataset | Type |
|---|---|
| FreedomIntelligence/alpaca-gpt4-deutsch | Alpaca-style |
| FreedomIntelligence/evol-instruct-deutsch | Evolved instructions |
| FreedomIntelligence/sharegpt-deutsch | Multi-turn conversations |
| LSX-UniWue/Guanako | German instruct |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"philipp-zettl/qwen3-0.6b-german-merged",
load_in_4bit=True,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("philipp-zettl/qwen3-0.6b-german-merged")
prompt = """<|im_start|>system
Du bist ein hilfreicher, präziser Assistent. Beantworte alle Fragen auf Deutsch.<|im_end|>
<|im_start|>user
Was ist der Unterschied zwischen Aktien und Anleihen?<|im_end|>
<|im_start|>assistant
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=300, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Limitations
- 0.6B parameters - will hallucinate on niche topics; not suitable for high-stakes applications
- German only - multilingual performance is not guaranteed
- Context length - trained at 2048 tokens; performance degrades on longer inputs
- Base model opacity - Qwen3's pretraining data is undisclosed; benchmark contamination cannot be fully ruled out
Citation
If you use this model, please also cite the LLäMmlein paper whose datasets and evaluation setup this work builds on:
@misc{pfister2025llammleintransparentcompactcompetitive,
title={LL\"aMmlein: Transparent, Compact and Competitive German-Only Language Models from Scratch},
author={Jan Pfister and Julia Wunderle and Andreas Hotho},
year={2025},
eprint={2411.11171},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2411.11171},
}
Generated April 01, 2026 · Trained by philipp-zettl
- Downloads last month
- 402