Update README.md
Browse files
README.md
CHANGED
|
@@ -16,15 +16,24 @@ library_name: transformers
|
|
| 16 |
- **Boldt-1B** *(this model)*
|
| 17 |
- [Boldt-1B-IT-Preview](https://huggingface.co/Boldt/Boldt-1B-IT-Preview)
|
| 18 |
|
| 19 |
-
|
|
|
|
| 20 |
|
| 21 |
-
-
|
| 22 |
-
- Information Value: retains only content-rich and fact-bearing documents
|
| 23 |
-
- Educational Quality: selects for pedagogical clarity and explanatory depth
|
| 24 |
|
| 25 |
-
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
## Usage
|
| 30 |
|
|
|
|
| 16 |
- **Boldt-1B** *(this model)*
|
| 17 |
- [Boldt-1B-IT-Preview](https://huggingface.co/Boldt/Boldt-1B-IT-Preview)
|
| 18 |
|
| 19 |
+
### Repetition over Diversity
|
| 20 |
+
The training philosophy behind **Boldt** is centered on a key finding from our research: **repetition over diversity**.
|
| 21 |
|
| 22 |
+
Standard pre-training paradigms typically balance quality filtering against the need for massive token volume and broad corpus diversity. In contrast, Boldt models are trained for multiple epochs on a highly filtered dataset: the German ***Dense-Core*** subset of [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2). We isolated this subset using a combination of three hierarchical filters:
|
|
|
|
|
|
|
| 23 |
|
| 24 |
+
- **Coherence:** Eliminates structurally fragmented or incoherent documents.
|
| 25 |
+
- **Information Value:** Isolates content-rich and fact-bearing texts.
|
| 26 |
+
- **Educational Quality:** Selects strictly for pedagogical clarity and deep explanations.
|
| 27 |
|
| 28 |
+
We demonstrate that repeated exposure to this strict, high-quality subset is more sample-efficient than a single pass over less filtered and more diverse corpora. For a comprehensive look at our experiments, please refer to our preprint: [*Repetition over Diversity*](https://arxiv.org/abs/2604.28075).
|
| 29 |
+
|
| 30 |
+
**Boldt-1B** builds upon the foundation of [Boldt-DC-1B](https://huggingface.co/Boldt/Boldt-DC-1B) by adding 6B tokens of premium German news data, collected continuously since 2022 via the [Fundus](https://github.com/flairnlp/fundus) library. To support complex downstream tasks, it also features a doubled context window of 4096 tokens.
|
| 31 |
+
|
| 32 |
+
## Model Architecture
|
| 33 |
+
- **Parameters:** ~1 Billion
|
| 34 |
+
- **Context Window:** 4096 tokens
|
| 35 |
+
- **Training Data:** German Dense-Core subset (FineWeb-2) + 6B tokens high-quality Fundus news data
|
| 36 |
+
- **Language:** German
|
| 37 |
|
| 38 |
## Usage
|
| 39 |
|