Boldt
/

Boldt-1B

@@ -16,15 +16,24 @@ library_name: transformers
 - **Boldt-1B** *(this model)*
 - [Boldt-1B-IT-Preview](https://huggingface.co/Boldt/Boldt-1B-IT-Preview)
-Boldt models were trained on the German ***Dense-Core*** subset of [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) that was obtained using a combination of 3 hierarchical text quality filters:
-- Coherence: removes structurally fragmented or incoherent documents
-- Information Value: retains only content-rich and fact-bearing documents
-- Educational Quality: selects for pedagogical clarity and explanatory depth
-As a result, instead of single-pass pre-training on a large web corpus, Boldt models were trained for multiple epochs on a small, high-quality subset of a web corpus. We find that repeated training on high quality subsets outperforms single-pass training on larger, less diverse corpora. For more details regarding the origin of this model and the reasearch behind it, please refer to our [preprint](https://arxiv.org/abs/2604.28075)!
-**Boldt-1B** extends the training recipe of [Boldt-DC-1B](https://huggingface.co/Boldt/Boldt-DC-1B) by additionally including 6B tokens of high-quality German news articles crawled continuously since 2022 using the [Fundus](https://github.com/flairnlp/fundus) library. It is also trained with a larger context window of 4096, making it better suited for longer text sequences than the DC variants.
 ## Usage

 - **Boldt-1B** *(this model)*
 - [Boldt-1B-IT-Preview](https://huggingface.co/Boldt/Boldt-1B-IT-Preview)
+### Repetition over Diversity
+The training philosophy behind **Boldt** is centered on a key finding from our research: **repetition over diversity**.
+Standard pre-training paradigms typically balance quality filtering against the need for massive token volume and broad corpus diversity. In contrast, Boldt models are trained for multiple epochs on a highly filtered dataset: the German ***Dense-Core*** subset of [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2). We isolated this subset using a combination of three hierarchical filters:
+- **Coherence:** Eliminates structurally fragmented or incoherent documents.
+- **Information Value:** Isolates content-rich and fact-bearing texts.
+- **Educational Quality:** Selects strictly for pedagogical clarity and deep explanations.
+We demonstrate that repeated exposure to this strict, high-quality subset is more sample-efficient than a single pass over less filtered and more diverse corpora. For a comprehensive look at our experiments, please refer to our preprint: [*Repetition over Diversity*](https://arxiv.org/abs/2604.28075).
+**Boldt-1B** builds upon the foundation of [Boldt-DC-1B](https://huggingface.co/Boldt/Boldt-DC-1B) by adding 6B tokens of premium German news data, collected continuously since 2022 via the [Fundus](https://github.com/flairnlp/fundus) library. To support complex downstream tasks, it also features a doubled context window of 4096 tokens.
+## Model Architecture
+- **Parameters:** ~1 Billion
+- **Context Window:** 4096 tokens
+- **Training Data:** German Dense-Core subset (FineWeb-2) + 6B tokens high-quality Fundus news data
+- **Language:** German
 ## Usage