alanakbik commited on
Commit
af54089
·
verified ·
1 Parent(s): 97e5d00

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -18
README.md CHANGED
@@ -16,13 +16,17 @@ library_name: transformers
16
  - [Boldt-1B](https://huggingface.co/Boldt/Boldt-1B)
17
  - [Boldt-1B-IT-Preview](https://huggingface.co/Boldt/Boldt-1B-IT-Preview)
18
 
19
- Boldt models were trained on the German ***Dense-Core*** subset of [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) that was obtained using a combination of 3 hierarchical text quality filters:
 
20
 
21
- - Coherence: removes structurally fragmented or incoherent documents
22
- - Information Value: retains only content-rich and fact-bearing documents
23
- - Educational Quality: selects for pedagogical clarity and explanatory depth
 
 
 
 
24
 
25
- As a result, instead of single-pass pre-training on a large web corpus, Boldt models were trained for multiple epochs on a small, high-quality subset of a web corpus. We find that repeated training on high quality subsets outperforms single-pass training on larger, less diverse corpora. For more details regarding the origin of this model and the reasearch behind it, please refer to our [preprint](https://arxiv.org/abs/2604.28075)!
26
 
27
  ## Usage
28
 
@@ -42,19 +46,22 @@ outputs = model.generate(**inputs, max_new_tokens=64)
42
 
43
  ## Evaluation
44
 
45
- We evaluate Boldt-350M on our [modernized German benchmark suite](https://huggingface.co/collections/Boldt/german-llm-benchmarks). It comprises the German subset of [Global MMLU](https://huggingface.co/datasets/CohereLabs/Global-MMLU) and updated translations of widely used English benchmarks, produced using [Tower+ 72B](https://huggingface.co/Unbabel/Tower-Plus-72B) to address issues we identified in existing German benchmark translations.
46
- Despite being trained on a substantially smaller amount of data, Boldt-350M outperforms other similarly-sized SLMs capable of German on our evaluation suite. It also performs competitively with larger-sized (around 1B) German and multilingual models.
47
-
48
- | Category | Model | Tokens | MMLU | ARC-C | ARC-E | H-Swag | LAMBADA | OBQA | Avg. |
49
- |----------|--------|--------|------|-------|-------|--------|----------|------|------|
50
- | Ours | **Boldt-DC-350M** | 200B | *29.29* | *32.24* | **52.87** | **43.21** | *37.48*| **45.86** | **40.16** |
51
- | Reference models | [LLäMmlein-120M](https://huggingface.co/LSX-UniWue/LLaMmlein_120M) | 1T | 26.41 | 24.74 | 37.13 | 23.02 | 26.97 | 43.03 | 33.33 |
52
- | | [Gemma-3-270M](https://huggingface.co/google/gemma-3-270m) | 6T* | 26.09 | 24.93 | 34.68 | 31.60 | 29.85 | 37.37 | 32.07 |
53
- | | [Qwen3-0.6B-Base](Qwen/Qwen3-0.6B-Base) | 36T* | **29.87** | **32.90** | 41.90 | 38.13 | **39.57** | 41.01 | 37.23 |
54
- ||
55
- | Reference models - 1B | [LLäMmlein-1B](https://huggingface.co/LSX-UniWue/LLaMmlein_1B) | 1T | 29.26 | 30.27 | 48.19 | 44.80 | 44.89 | 47.27 | 40.78 |
56
- | | [Gemma-3-1B](https://huggingface.co/google/gemma-3-1b-pt) | 2T* | 30.01 | 30.55 | 47.89 | 43.43 | 41.71 | 45.05 | 39.77 |
57
- | | [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) | 9T* | 28.58 | 29.90 | 40.51 | 40.07 | 44.31 | 44.04 | 37.90 |
 
 
 
58
 
59
  ## Safety & Ethics
60
 
 
16
  - [Boldt-1B](https://huggingface.co/Boldt/Boldt-1B)
17
  - [Boldt-1B-IT-Preview](https://huggingface.co/Boldt/Boldt-1B-IT-Preview)
18
 
19
+ ### Repetition over Diversity
20
+ The training philosophy behind **Boldt** is centered on a key finding from our research: **repetition over diversity**.
21
 
22
+ Standard pre-training paradigms typically balance quality filtering against the need for massive token volume and broad corpus diversity. In contrast, Boldt models are trained for multiple epochs on a highly filtered dataset: the German ***Dense-Core*** subset of [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2). We isolated this subset using a combination of three hierarchical filters:
23
+
24
+ - **Coherence:** Eliminates structurally fragmented or incoherent documents.
25
+ - **Information Value:** Isolates content-rich and fact-bearing texts.
26
+ - **Educational Quality:** Selects strictly for pedagogical clarity and deep explanations.
27
+
28
+ We demonstrate that repeated exposure to this strict, high-quality subset is more sample-efficient than a single pass over less filtered and more diverse corpora. For a comprehensive look at our experiments, please refer to our preprint: [*Repetition over Diversity*](https://arxiv.org/abs/2604.28075).
29
 
 
30
 
31
  ## Usage
32
 
 
46
 
47
  ## Evaluation
48
 
49
+ We evaluate Boldt-350M on our [modernized German benchmark suite](https://huggingface.co/collections/Boldt/german-llm-benchmarks). See our paper [(Aynetdinov et al., 2026)](https://arxiv.org/abs/2604.28075) for details on the structural and translation corrections we performed.
50
+
51
+ Boldt-350M while significantly smaller than the 1B models we compare with, still fares comparatively well, outperforming the much larger, multilingual Gemma-3-1B and Llama-3.2-1B models.
52
+
53
+
54
+ ### 1B Weight Class (Direct Comparison)
55
+ *Note: Bold text indicates the best score in the 1B category.*
56
+
57
+ | Model | Tokens | MMLU | ARC-C | ARC-E | H-Swag | LAMBADA | OBQA | Avg. |
58
+ | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
59
+ | [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M) | 200B | 29.29 | 32.24 | 52.87 | 43.21 | 37.48 | 45.86 | 40.16 |
60
+ | [Boldt-DC-1B](https://huggingface.co/Boldt/Boldt-DC-1B) | 200B | 31.06 | **35.99** | **57.30** | 48.69 | 42.80 | 48.48 | 44.05 |
61
+ | **Boldt-1B (this model)** | 230B | **31.42** | 34.11 | 55.78 | **48.77** | 44.70 | **52.32** | **44.52** |
62
+ | [LLäMmlein-1B](https://huggingface.co/LSX-UniWue/LLaMmlein_1B) | 1T | 29.26 | 30.27 | 48.19 | 44.80 | **44.89** | 47.27 | 40.78 |
63
+ | [Gemma-3-1B](https://huggingface.co/google/gemma-3-1b-pt) | 2T* | 30.01 | 30.55 | 47.89 | 43.43 | 41.71 | 45.05 | 39.77 |
64
+ | [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) | 9T* | 28.58 | 29.90 | 40.51 | 40.07 | 44.31 | 44.04 | 37.90 |
65
 
66
  ## Safety & Ethics
67