qwen3-0.6b-german

A German instruction-following model fine-tuned from Qwen3-0.6B using QLoRA on the same four German instruct datasets used in the LLäMmlein paper (Pfister et al., ACL 2025).

Trained on a single RTX 4070 Ti (8GB VRAM) in ~40 hours - no cloud compute required.

For a merged version (Base model + QLoRa) see philipp-zettl/qwen3-0.6b-german-merged.

Results

Evaluated using lm-evaluation-harness on the same German tasks reported in the LLäMmlein paper. All numbers below are from our own evaluation runs - not copied from the paper - ensuring a fair, version-consistent comparison.

Main benchmarks

Task	Qwen3-0.6B (base)	qwen3-0.6b-german-merged (ours)	LLäMmlein 1B	LLäMmlein 1B Instruct
HellaSwag-DE	0.3111	0.3193	0.4366	0.4492
ARC-DE	0.2352	0.2575	0.2729	0.2635
MMLU-DE (avg)	0.3600	0.2475	0.2350	0.2400

Bold = best among our models. All scores are accuracy / normalized accuracy.

vs. Base model (Δ)

Task	Base	Fine-tuned	Δ
HellaSwag-DE	0.3111	0.3193	✅ +0.0082
ARC-DE	0.2352	0.2575	✅ +0.0222
MMLU-DE (avg)	0.3600	0.2475	🔻 -0.1125
MMLU-DE Business	0.3276	0.1897	🔻 -0.1379
MMLU-DE Humanities	0.3627	0.2745	🔻 -0.0882
MMLU-DE Medical	0.3333	0.3056	🔻 -0.0278
MMLU-DE Other	0.4643	0.3929	🔻 -0.0714
MMLU-DE Social Sciences	0.3529	0.1569	🔻 -0.1961
MMLU-DE STEM	0.3043	0.2391	🔻 -0.0652

Fine-tuned model wins on 2/3 main tasks. Average delta across main tasks: -0.0273.

MMLU-DE breakdown

Task	Qwen3-0.6B (base)	qwen3-0.6b-german-merged (ours)	LLäMmlein 1B	LLäMmlein 1B Instruct
MMLU-DE Business	0.3276	0.1897	0.2069	0.3103
MMLU-DE Humanities	0.3627	0.2745	0.2745	0.2255
MMLU-DE Medical	0.3333	0.3056	0.3611	0.3056
MMLU-DE Other	0.4643	0.3929	0.2857	0.2143
MMLU-DE Social Sciences	0.3529	0.1569	0.1863	0.2157
MMLU-DE STEM	0.3043	0.2391	0.1304	0.2174

Key findings

HellaSwag-DE and ARC-DE consistently improve over the base model across all checkpoints, confirming that instruction fine-tuning on German data improves commonsense reasoning.
MMLU-DE shows an initial alignment tax (drops at ~20k steps) followed by full recovery and net improvement by the final checkpoint - consistent with known instruct fine-tuning dynamics.
Competitive with LLäMmlein 1B on knowledge tasks despite having 40% fewer parameters, demonstrating that Qwen3's pretraining provides a strong German foundation.

Training details

Property	Value
Base model	Qwen3-0.6B
Method	QLoRA (4-bit, rank 64, RSLoRA)
Training steps	40,000
Effective batch size	16
Learning rate	2e-4 (cosine schedule)
Context length	2048 tokens
Hardware	RTX 4070 Ti 8GB
Framework	Unsloth 2026.2.1
Code	https://github.com/philsupertramp/gwen

LoRA target modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Training data

Same four datasets used in the LLäMmlein paper:

Dataset	Type
FreedomIntelligence/alpaca-gpt4-deutsch	Alpaca-style
FreedomIntelligence/evol-instruct-deutsch	Evolved instructions
FreedomIntelligence/sharegpt-deutsch	Multi-turn conversations
LSX-UniWue/Guanako	German instruct

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "philipp-zettl/qwen3-0.6b-german-merged",
    load_in_4bit=True,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("philipp-zettl/qwen3-0.6b-german-merged")

prompt = """<|im_start|>system
Du bist ein hilfreicher, präziser Assistent. Beantworte alle Fragen auf Deutsch.<|im_end|>
<|im_start|>user
Was ist der Unterschied zwischen Aktien und Anleihen?<|im_end|>
<|im_start|>assistant
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=300, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Limitations

0.6B parameters - will hallucinate on niche topics; not suitable for high-stakes applications
German only - multilingual performance is not guaranteed
Context length - trained at 2048 tokens; performance degrades on longer inputs
Base model opacity - Qwen3's pretraining data is undisclosed; benchmark contamination cannot be fully ruled out

Citation

If you use this model, please also cite the LLäMmlein paper whose datasets and evaluation setup this work builds on:

@misc{pfister2025llammleintransparentcompactcompetitive,
      title={LL\"aMmlein: Transparent, Compact and Competitive German-Only Language Models from Scratch}, 
      author={Jan Pfister and Julia Wunderle and Andreas Hotho},
      year={2025},
      eprint={2411.11171},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.11171}, 
}

Generated April 01, 2026 · Trained by philipp-zettl

Downloads last month: 402

Datasets used to train philipp-zettl/qwen3-0.6b-german

Paper for philipp-zettl/qwen3-0.6b-german

LLäMmlein: Compact and Competitive German-Only Language Models from Scratch

Paper • 2411.11171 • Published Nov 17, 2024 • 8