alanakbik commited on
Commit
603618d
·
verified ·
1 Parent(s): 3f0a725

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -6
README.md CHANGED
@@ -16,15 +16,24 @@ library_name: transformers
16
  - **Boldt-1B** *(this model)*
17
  - [Boldt-1B-IT-Preview](https://huggingface.co/Boldt/Boldt-1B-IT-Preview)
18
 
19
- Boldt models were trained on the German ***Dense-Core*** subset of [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) that was obtained using a combination of 3 hierarchical text quality filters:
 
20
 
21
- - Coherence: removes structurally fragmented or incoherent documents
22
- - Information Value: retains only content-rich and fact-bearing documents
23
- - Educational Quality: selects for pedagogical clarity and explanatory depth
24
 
25
- As a result, instead of single-pass pre-training on a large web corpus, Boldt models were trained for multiple epochs on a small, high-quality subset of a web corpus. We find that repeated training on high quality subsets outperforms single-pass training on larger, less diverse corpora. For more details regarding the origin of this model and the reasearch behind it, please refer to our [preprint](https://arxiv.org/abs/2604.28075)!
 
 
26
 
27
- **Boldt-1B** extends the training recipe of [Boldt-DC-1B](https://huggingface.co/Boldt/Boldt-DC-1B) by additionally including 6B tokens of high-quality German news articles crawled continuously since 2022 using the [Fundus](https://github.com/flairnlp/fundus) library. It is also trained with a larger context window of 4096, making it better suited for longer text sequences than the DC variants.
 
 
 
 
 
 
 
 
28
 
29
  ## Usage
30
 
 
16
  - **Boldt-1B** *(this model)*
17
  - [Boldt-1B-IT-Preview](https://huggingface.co/Boldt/Boldt-1B-IT-Preview)
18
 
19
+ ### Repetition over Diversity
20
+ The training philosophy behind **Boldt** is centered on a key finding from our research: **repetition over diversity**.
21
 
22
+ Standard pre-training paradigms typically balance quality filtering against the need for massive token volume and broad corpus diversity. In contrast, Boldt models are trained for multiple epochs on a highly filtered dataset: the German ***Dense-Core*** subset of [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2). We isolated this subset using a combination of three hierarchical filters:
 
 
23
 
24
+ - **Coherence:** Eliminates structurally fragmented or incoherent documents.
25
+ - **Information Value:** Isolates content-rich and fact-bearing texts.
26
+ - **Educational Quality:** Selects strictly for pedagogical clarity and deep explanations.
27
 
28
+ We demonstrate that repeated exposure to this strict, high-quality subset is more sample-efficient than a single pass over less filtered and more diverse corpora. For a comprehensive look at our experiments, please refer to our preprint: [*Repetition over Diversity*](https://arxiv.org/abs/2604.28075).
29
+
30
+ **Boldt-1B** builds upon the foundation of [Boldt-DC-1B](https://huggingface.co/Boldt/Boldt-DC-1B) by adding 6B tokens of premium German news data, collected continuously since 2022 via the [Fundus](https://github.com/flairnlp/fundus) library. To support complex downstream tasks, it also features a doubled context window of 4096 tokens.
31
+
32
+ ## Model Architecture
33
+ - **Parameters:** ~1 Billion
34
+ - **Context Window:** 4096 tokens
35
+ - **Training Data:** German Dense-Core subset (FineWeb-2) + 6B tokens high-quality Fundus news data
36
+ - **Language:** German
37
 
38
  ## Usage
39