aynetdia commited on
Commit
a0646cc
·
verified ·
1 Parent(s): 2c8aac0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -3
README.md CHANGED
@@ -1,3 +1,78 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - de
5
+ pipeline_tag: text-generation
6
+ library_name: transformers
7
+ ---
8
+ # Boldt-DC-1B
9
+
10
+ <img src="logo.png" width="500">
11
+
12
+ **Boldt** is a series of German Small Language Models (SLMs) trained from scratch. Our inital release includes three models:
13
+
14
+ - [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M)
15
+ - **Boldt-DC-1B** *(this model)*
16
+ - [Boldt-1B](https://huggingface.co/Boldt/Boldt-1B)
17
+ - [Boldt-1B-IT-Preview](https://huggingface.co/Boldt/Boldt-1B-IT-Preview)
18
+
19
+ Boldt models were trained on the German ***Dense-Core*** subset of [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) that was obtained using a combination of 3 hierarchical text quality filters:
20
+
21
+ - Coherence: removes structurally fragmented or incoherent documents
22
+ - Information Value: retains only content-rich and fact-bearing documents
23
+ - Educational Quality: selects for pedagogical clarity and explanatory depth
24
+
25
+ As a result, instead of single-pass pre-training on a large web corpus, Boldt models were trained for multiple epochs on a small, high-quality subset of a web corpus. We find that repeated training on high quality subsets outperforms single-pass training on larger, less diverse corpora. For more details regarding the origin of this model and the reasearch behind it, please refer to our [preprint](https://arxiv.org/abs/2604.28075)!
26
+
27
+ ## Usage
28
+
29
+ **Note:** This is a base language model, not an instruction-tuned model. It is not optimized for chat or instruction following. For best results, use standard text completion rather than chat templates.
30
+
31
+ ```python
32
+ from transformers import AutoTokenizer, AutoModelForCausalLM
33
+ model_name = "Boldt/Boldt-1B"
34
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
35
+ model = AutoModelForCausalLM.from_pretrained(model_name)
36
+
37
+ # Basic text completion
38
+ text = "Berlin ist eine Stadt, wo"
39
+ inputs = tokenizer(text, return_tensors="pt")
40
+ outputs = model.generate(**inputs, max_new_tokens=64)
41
+ ```
42
+
43
+ ## Evaluation
44
+
45
+ We evaluate Boldt-DC-1B on our [modernized German benchmark suite](https://huggingface.co/collections/Boldt/german-llm-benchmarks). It comprises the German subset of [Global MMLU](https://huggingface.co/datasets/CohereLabs/Global-MMLU) and updated translations of widely used English benchmarks, produced using [Tower+ 72B](https://huggingface.co/Unbabel/Tower-Plus-72B) to address issues we identified in existing German benchmark translations.
46
+ Despite being trained on a substantially smaller amount of data, Boldt-DC-1B outperforms other similarly-sized SLMs capable of German on our evaluation suite. It also performs competitively with larger-sized (around 2B) multilingual models.
47
+
48
+ | Category | Model | Tokens | MMLU | ARC-C | ARC-E | H-Swag | LAMBADA | OBQA | Avg. |
49
+ |----------|--------|--------|------|-------|-------|--------|----------|------|------|
50
+ | Ours | [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M) | 200B | 29.29 | 32.24 | 52.87 | 43.21 | 37.48 | 45.86 | 40.16 |
51
+ | | **Boldt-DC-1B** | 200B | 31.06 | **35.99** | **57.30** | *48.69* | 42.80 | *48.48* | 44.05 |
52
+ | | [Boldt-1B](https://huggingface.co/Boldt/Boldt-1B) | 230B | **31.42** | *34.11* | *55.78* | **48.77** | *44.70* | **52.32** | **44.52** |
53
+ | Reference models - 1B | [LLäMmlein-1B](https://huggingface.co/LSX-UniWue/LLaMmlein_1B) | 1T | 29.26 | 30.27 | 48.19 | 44.80 | **44.89** | 47.27 | 40.78 |
54
+ | | [Gemma-3-1B](https://huggingface.co/google/gemma-3-1b-pt) | 2T* | 30.01 | 30.55 | 47.89 | 43.43 | 41.71 | 45.05 | 39.77 |
55
+ | | [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) | 9T* | 28.58 | 29.90 | 40.51 | 40.07 | 44.31 | 44.04 | 37.90 |
56
+ ||
57
+ | Reference models - >1B | [EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B) | 4T* | 31.04 | 31.58 | 54.68 | 45.30 | 44.52 | 50.50 | 42.94 |
58
+ | | [Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base) | 36T* | 34.17 | 37.49 | 57.00 | 45.20 | 49.81 | 45.66 | 44.89 |
59
+ | | [BübleLM-2B](https://huggingface.co/flair/bueble-lm-2b) | 2T* | 29.68 | 32.62 | 53.63 | 46.57 | 43.55 | 49.70 | 42.63 |
60
+ | | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b) | 2T* | 33.99 | 37.11 | 57.47 | 49.62 | 52.64 | 48.89 | 46.62 |
61
+
62
+ ## Safety & Ethics
63
+
64
+ We have not conducted systematic model evaluations of toxicity, demographic biases, or harmful stereotypes. Quality filtering may reduce some risks relative to unfiltered web data, but cannot guarantee their absence, and repeated exposure during multi-epoch training could amplify rather than mitigate encoded biases. Users should exercise caution in sensitive use-cases without further evaluation.
65
+
66
+ ## Citation
67
+ ```bibtex
68
+ @misc{boldt2026,
69
+ title={Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling},
70
+ author={Ansar Aynetdinov and Patrick Haller and Alan Akbik},
71
+ year={2026},
72
+ eprint={2604.28075},
73
+ archivePrefix={arXiv},
74
+ primaryClass={cs.CL},
75
+ url={https://arxiv.org/abs/2604.28075},
76
+ }
77
+ ```
78
+