aynetdia commited on
Commit
3750367
·
verified ·
1 Parent(s): 4eb00d1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -3
README.md CHANGED
@@ -1,3 +1,79 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - de
5
+ pipeline_tag: text-generation
6
+ library_name: transformers
7
+ ---
8
+ # Boldt-1B
9
+
10
+ <img src="logo.png" width="500">
11
+
12
+ **Boldt** is a series of German Small Language Models (SLMs) trained from scratch. Our inital release includes three models:
13
+
14
+ - [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M)
15
+ - [Boldt-DC-1B](https://huggingface.co/Boldt/Boldt-DC-1B)
16
+ - **Boldt-1B** *(this model)*
17
+ - [Boldt-1B-IT-Preview](https://huggingface.co/Boldt/Boldt-1B-IT-Preview)
18
+
19
+ Boldt models were trained on the German ***Dense-Core*** subset of [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) that was obtained using a combination of 3 hierarchical text quality filters:
20
+
21
+ - Coherence: removes structurally fragmented or incoherent documents
22
+ - Information Value: retains only content-rich and fact-bearing documents
23
+ - Educational Quality: selects for pedagogical clarity and explanatory depth
24
+
25
+ As a result, instead of single-pass pre-training on a large web corpus, Boldt models were trained for multiple epochs on a small, high-quality subset of a web corpus. We find that repeated training on high quality subsets outperforms single-pass training on larger, less diverse corpora. For more details regarding the origin of this model and the reasearch behind it, please refer to our [preprint](https://arxiv.org/abs/2604.28075)!
26
+
27
+ **Boldt-1B** extends the training recipe of [Boldt-DC-1B](https://huggingface.co/Boldt/Boldt-DC-1B) by additionally including 6B tokens of high-quality German news articles crawled continuously since 2022 using the [Fundus](https://github.com/flairnlp/fundus) library. It is also trained with a larger context window of 4096, making it better suited for longer text sequences than the DC variants.
28
+
29
+ ## Usage
30
+
31
+ **Note:** This is a base language model, not an instruction-tuned model. It is not optimized for chat or instruction following. For best results, use standard text completion rather than chat templates.
32
+
33
+ ```python
34
+ from transformers import AutoTokenizer, AutoModelForCausalLM
35
+ model_name = "Boldt/Boldt-1B"
36
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
37
+ model = AutoModelForCausalLM.from_pretrained(model_name)
38
+
39
+ # Basic text completion
40
+ text = "Berlin ist eine Stadt, wo"
41
+ inputs = tokenizer(text, return_tensors="pt")
42
+ outputs = model.generate(**inputs, max_new_tokens=64)
43
+ ```
44
+
45
+ ## Evaluation
46
+
47
+ We evaluate Boldt-1B on our [modernized German benchmark suite](https://huggingface.co/collections/Boldt/german-llm-benchmarks). It comprises the German subset of [Global MMLU](https://huggingface.co/datasets/CohereLabs/Global-MMLU) and updated translations of widely used English benchmarks, produced using [Tower+ 72B](https://huggingface.co/Unbabel/Tower-Plus-72B) to address issues we identified in existing German benchmark translations.
48
+ Despite being trained on a substantially smaller amount of data, Boldt-1B outperforms other similarly-sized SLMs capable of German on our evaluation suite. It also performs competitively with larger-sized(around 2B) multilingual models.
49
+
50
+ | Category | Model | Tokens | MMLU | ARC-C | ARC-E | H-Swag | LAMBADA | OBQA | Avg. |
51
+ |----------|--------|--------|------|-------|-------|--------|----------|------|------|
52
+ | Ours | [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M) | 200B | 29.29 | 32.24 | 52.87 | 43.21 | 37.48 | 45.86 | 40.16 |
53
+ | | [Boldt-DC-1B](https://huggingface.co/Boldt/Boldt-DC-1B) | 200B | 31.06 | **35.99** | **57.30** | *48.69* | 42.80 | *48.48* | 44.05 |
54
+ | | **Boldt-1B** | 230B | **31.42** | *34.11* | *55.78* | **48.77** | *44.70* | **52.32** | **44.52** |
55
+ | Reference models - 1B | [LLäMmlein-1B](https://huggingface.co/LSX-UniWue/LLaMmlein_1B) | 1T | 29.26 | 30.27 | 48.19 | 44.80 | **44.89** | 47.27 | 40.78 |
56
+ | | [Gemma-3-1B](https://huggingface.co/google/gemma-3-1b-pt) | 2T* | 30.01 | 30.55 | 47.89 | 43.43 | 41.71 | 45.05 | 39.77 |
57
+ | | [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) | 9T* | 28.58 | 29.90 | 40.51 | 40.07 | 44.31 | 44.04 | 37.90 |
58
+ ||
59
+ | Reference models - >1B | [EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B) | 4T* | 31.04 | 31.58 | 54.68 | 45.30 | 44.52 | 50.50 | 42.94 |
60
+ | | [Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base) | 36T* | 34.17 | 37.49 | 57.00 | 45.20 | 49.81 | 45.66 | 44.89 |
61
+ | | [BübleLM-2B](https://huggingface.co/flair/bueble-lm-2b) | 2T* | 29.68 | 32.62 | 53.63 | 46.57 | 43.55 | 49.70 | 42.63 |
62
+ | | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b) | 2T* | 33.99 | 37.11 | 57.47 | 49.62 | 52.64 | 48.89 | 46.62 |
63
+
64
+ ## Safety & Ethics
65
+
66
+ We have not conducted systematic model evaluations of toxicity, demographic biases, or harmful stereotypes. Quality filtering may reduce some risks relative to unfiltered web data, but cannot guarantee their absence, and repeated exposure during multi-epoch training could amplify rather than mitigate encoded biases. Users should exercise caution in sensitive use-cases without further evaluation.
67
+
68
+ ## Citation
69
+ ```bibtex
70
+ @misc{boldt2026,
71
+ title={Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling},
72
+ author={Ansar Aynetdinov and Patrick Haller and Alan Akbik},
73
+ year={2026},
74
+ eprint={2604.28075},
75
+ archivePrefix={arXiv},
76
+ primaryClass={cs.CL},
77
+ url={https://arxiv.org/abs/2604.28075},
78
+ }
79
+ ```