Update README.md
Browse files
README.md
CHANGED
|
@@ -4,25 +4,42 @@ language:
|
|
| 4 |
- de
|
| 5 |
pipeline_tag: text-generation
|
| 6 |
library_name: transformers
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
---
|
|
|
|
| 8 |
# Boldt-DC-1B
|
| 9 |
|
| 10 |
-
<img src="logo.png" width="500">
|
| 11 |
|
| 12 |
-
**Boldt** is a series of German Small Language Models (SLMs) trained from scratch. Our
|
| 13 |
|
| 14 |
- [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M)
|
| 15 |
-
- **Boldt-DC-1B** *(this model)*
|
| 16 |
- [Boldt-1B](https://huggingface.co/Boldt/Boldt-1B)
|
| 17 |
- [Boldt-1B-IT-Preview](https://huggingface.co/Boldt/Boldt-1B-IT-Preview)
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
-
|
| 22 |
-
- Information Value: retains only content-rich and fact-bearing documents
|
| 23 |
-
- Educational Quality: selects for pedagogical clarity and explanatory depth
|
| 24 |
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
|
| 27 |
## Usage
|
| 28 |
|
|
@@ -30,7 +47,7 @@ As a result, instead of single-pass pre-training on a large web corpus, Boldt mo
|
|
| 30 |
|
| 31 |
```python
|
| 32 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 33 |
-
model_name = "Boldt/Boldt-
|
| 34 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 35 |
model = AutoModelForCausalLM.from_pretrained(model_name)
|
| 36 |
|
|
@@ -42,22 +59,32 @@ outputs = model.generate(**inputs, max_new_tokens=64)
|
|
| 42 |
|
| 43 |
## Evaluation
|
| 44 |
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
|
| 55 |
-
|
|
| 56 |
-
||
|
| 57 |
-
|
|
| 58 |
-
|
|
| 59 |
-
|
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
## Safety & Ethics
|
| 63 |
|
|
@@ -75,4 +102,3 @@ We have not conducted systematic model evaluations of toxicity, demographic bias
|
|
| 75 |
url={https://arxiv.org/abs/2604.28075},
|
| 76 |
}
|
| 77 |
```
|
| 78 |
-
|
|
|
|
| 4 |
- de
|
| 5 |
pipeline_tag: text-generation
|
| 6 |
library_name: transformers
|
| 7 |
+
tags:
|
| 8 |
+
- text-generation
|
| 9 |
+
- nlp
|
| 10 |
+
- custom_code
|
| 11 |
+
- german
|
| 12 |
---
|
| 13 |
+
|
| 14 |
# Boldt-DC-1B
|
| 15 |
|
| 16 |
+
<img src="logo.png" width="500" alt="Boldt Logo">
|
| 17 |
|
| 18 |
+
**Boldt** is a series of German Small Language Models (SLMs) trained from scratch. Our initial release includes four models:
|
| 19 |
|
| 20 |
- [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M)
|
| 21 |
+
- **Boldt-DC-1B** *(this model)*
|
| 22 |
- [Boldt-1B](https://huggingface.co/Boldt/Boldt-1B)
|
| 23 |
- [Boldt-1B-IT-Preview](https://huggingface.co/Boldt/Boldt-1B-IT-Preview)
|
| 24 |
|
| 25 |
+
### Repetition over Diversity
|
| 26 |
+
The training philosophy behind **Boldt** is centered on a key finding from our research: **repetition over diversity**.
|
| 27 |
+
|
| 28 |
+
Standard pre-training paradigms typically balance quality filtering against the need for massive token volume and broad corpus diversity. In contrast, Boldt models are trained for multiple epochs on a highly filtered dataset: the German ***Dense-Core*** subset of [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2). We isolated this subset using a combination of three hierarchical filters:
|
| 29 |
+
|
| 30 |
+
- **Coherence:** Eliminates structurally fragmented or incoherent documents.
|
| 31 |
+
- **Information Value:** Isolates content-rich and fact-bearing texts.
|
| 32 |
+
- **Educational Quality:** Selects strictly for pedagogical clarity and deep explanations.
|
| 33 |
|
| 34 |
+
We demonstrate that repeated exposure to this strict, high-quality subset is more sample-efficient than a single pass over less filtered and more diverse corpora. For a comprehensive look at our experiments, please refer to our preprint: [*Repetition over Diversity*](https://arxiv.org/abs/2604.28075).
|
|
|
|
|
|
|
| 35 |
|
| 36 |
+
**Boldt-DC-1B** represents the highly optimized 1-billion parameter foundation of this methodology, trained over multiple epochs on 200B tokens of our extreme-signal dataset.
|
| 37 |
+
|
| 38 |
+
## Model Architecture
|
| 39 |
+
- **Parameters:** ~1 Billion
|
| 40 |
+
- **Context Window:** 2048 tokens
|
| 41 |
+
- **Training Data:** German Dense-Core subset (FineWeb-2) [200B tokens]
|
| 42 |
+
- **Language:** German
|
| 43 |
|
| 44 |
## Usage
|
| 45 |
|
|
|
|
| 47 |
|
| 48 |
```python
|
| 49 |
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 50 |
+
model_name = "Boldt/Boldt-1B"
|
| 51 |
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 52 |
model = AutoModelForCausalLM.from_pretrained(model_name)
|
| 53 |
|
|
|
|
| 59 |
|
| 60 |
## Evaluation
|
| 61 |
|
| 62 |
+

|
| 63 |
+
|
| 64 |
+
We evaluate Boldt-1B on our [modernized German benchmark suite](https://huggingface.co/collections/Boldt/german-llm-benchmarks). See our paper [(Aynetdinov et al., 2026)](https://arxiv.org/abs/2604.28075) for details on the structural and translation corrections we performed.
|
| 65 |
+
|
| 66 |
+
Despite being trained on substantially fewer tokens, the Boldt-1B family outperforms other 1B-class models on German tasks and performs competitively with much larger multilingual models.
|
| 67 |
+
|
| 68 |
+
### 1B Weight Class (Direct Comparison)
|
| 69 |
+
*Note: Bold text indicates the best score in the 1B category.*
|
| 70 |
+
|
| 71 |
+
| Model | Tokens | MMLU | ARC-C | ARC-E | H-Swag | LAMBADA | OBQA | Avg. |
|
| 72 |
+
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|
| 73 |
+
| [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M) | 200B | 29.29 | 32.24 | 52.87 | 43.21 | 37.48 | 45.86 | 40.16 |
|
| 74 |
+
| [Boldt-DC-1B](https://huggingface.co/Boldt/Boldt-DC-1B) | 200B | 31.06 | **35.99** | **57.30** | 48.69 | 42.80 | 48.48 | 44.05 |
|
| 75 |
+
| **Boldt-1B (this model)** | 230B | **31.42** | 34.11 | 55.78 | **48.77** | 44.70 | **52.32** | **44.52** |
|
| 76 |
+
| [LLäMmlein-1B](https://huggingface.co/LSX-UniWue/LLaMmlein_1B) | 1T | 29.26 | 30.27 | 48.19 | 44.80 | **44.89** | 47.27 | 40.78 |
|
| 77 |
+
| [Gemma-3-1B](https://huggingface.co/google/gemma-3-1b-pt) | 2T* | 30.01 | 30.55 | 47.89 | 43.43 | 41.71 | 45.05 | 39.77 |
|
| 78 |
+
| [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) | 9T* | 28.58 | 29.90 | 40.51 | 40.07 | 44.31 | 44.04 | 37.90 |
|
| 79 |
+
|
| 80 |
+
### 1.7B - 2B Weight Class (Larger Reference Models)
|
| 81 |
+
|
| 82 |
+
| Model | Tokens | MMLU | ARC-C | ARC-E | H-Swag | LAMBADA | OBQA | Avg. |
|
| 83 |
+
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|
| 84 |
+
| [EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B) | 4T* | 31.04 | 31.58 | 54.68 | 45.30 | 44.52 | 50.50 | 42.94 |
|
| 85 |
+
| [Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base) | 36T* | **34.17** | **37.49** | 57.00 | 45.20 | 49.81 | 45.66 | 44.89 |
|
| 86 |
+
| [BübleLM-2B](https://huggingface.co/flair/bueble-lm-2b) | 2T* | 29.68 | 32.62 | 53.63 | 46.57 | 43.55 | 49.70 | 42.63 |
|
| 87 |
+
| [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b) | 2T* | 33.99 | 37.11 | **57.47** | **49.62** | **52.64** | **48.89** | **46.62** |
|
| 88 |
|
| 89 |
## Safety & Ethics
|
| 90 |
|
|
|
|
| 102 |
url={https://arxiv.org/abs/2604.28075},
|
| 103 |
}
|
| 104 |
```
|
|
|