Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,78 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- de
|
| 5 |
+
pipeline_tag: text-generation
|
| 6 |
+
library_name: transformers
|
| 7 |
+
---
|
| 8 |
+
# Boldt-DC-1B
|
| 9 |
+
|
| 10 |
+
<img src="logo.png" width="500">
|
| 11 |
+
|
| 12 |
+
**Boldt** is a series of German Small Language Models (SLMs) trained from scratch. Our inital release includes three models:
|
| 13 |
+
|
| 14 |
+
- [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M)
|
| 15 |
+
- **Boldt-DC-1B** *(this model)*
|
| 16 |
+
- [Boldt-1B](https://huggingface.co/Boldt/Boldt-1B)
|
| 17 |
+
- [Boldt-1B-IT-Preview](https://huggingface.co/Boldt/Boldt-1B-IT-Preview)
|
| 18 |
+
|
| 19 |
+
Boldt models were trained on the German ***Dense-Core*** subset of [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) that was obtained using a combination of 3 hierarchical text quality filters:
|
| 20 |
+
|
| 21 |
+
- Coherence: removes structurally fragmented or incoherent documents
|
| 22 |
+
- Information Value: retains only content-rich and fact-bearing documents
|
| 23 |
+
- Educational Quality: selects for pedagogical clarity and explanatory depth
|
| 24 |
+
|
| 25 |
+
As a result, instead of single-pass pre-training on a large web corpus, Boldt models were trained for multiple epochs on a small, high-quality subset of a web corpus. We find that repeated training on high quality subsets outperforms single-pass training on larger, less diverse corpora. For more details regarding the origin of this model and the reasearch behind it, please refer to our [preprint](https://arxiv.org/abs/2604.28075)!
|
| 26 |
+
|
| 27 |
+
## Usage
|
| 28 |
+
|
| 29 |
+
**Note:** This is a base language model, not an instruction-tuned model. It is not optimized for chat or instruction following. For best results, use standard text completion rather than chat templates.
|
| 30 |
+
|
| 31 |
+
```python
|
| 32 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 33 |
+
model_name = "Boldt/Boldt-1B"
|
| 34 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 35 |
+
model = AutoModelForCausalLM.from_pretrained(model_name)
|
| 36 |
+
|
| 37 |
+
# Basic text completion
|
| 38 |
+
text = "Berlin ist eine Stadt, wo"
|
| 39 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 40 |
+
outputs = model.generate(**inputs, max_new_tokens=64)
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
## Evaluation
|
| 44 |
+
|
| 45 |
+
We evaluate Boldt-DC-1B on our [modernized German benchmark suite](https://huggingface.co/collections/Boldt/german-llm-benchmarks). It comprises the German subset of [Global MMLU](https://huggingface.co/datasets/CohereLabs/Global-MMLU) and updated translations of widely used English benchmarks, produced using [Tower+ 72B](https://huggingface.co/Unbabel/Tower-Plus-72B) to address issues we identified in existing German benchmark translations.
|
| 46 |
+
Despite being trained on a substantially smaller amount of data, Boldt-DC-1B outperforms other similarly-sized SLMs capable of German on our evaluation suite. It also performs competitively with larger-sized (around 2B) multilingual models.
|
| 47 |
+
|
| 48 |
+
| Category | Model | Tokens | MMLU | ARC-C | ARC-E | H-Swag | LAMBADA | OBQA | Avg. |
|
| 49 |
+
|----------|--------|--------|------|-------|-------|--------|----------|------|------|
|
| 50 |
+
| Ours | [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M) | 200B | 29.29 | 32.24 | 52.87 | 43.21 | 37.48 | 45.86 | 40.16 |
|
| 51 |
+
| | **Boldt-DC-1B** | 200B | 31.06 | **35.99** | **57.30** | *48.69* | 42.80 | *48.48* | 44.05 |
|
| 52 |
+
| | [Boldt-1B](https://huggingface.co/Boldt/Boldt-1B) | 230B | **31.42** | *34.11* | *55.78* | **48.77** | *44.70* | **52.32** | **44.52** |
|
| 53 |
+
| Reference models - 1B | [LLäMmlein-1B](https://huggingface.co/LSX-UniWue/LLaMmlein_1B) | 1T | 29.26 | 30.27 | 48.19 | 44.80 | **44.89** | 47.27 | 40.78 |
|
| 54 |
+
| | [Gemma-3-1B](https://huggingface.co/google/gemma-3-1b-pt) | 2T* | 30.01 | 30.55 | 47.89 | 43.43 | 41.71 | 45.05 | 39.77 |
|
| 55 |
+
| | [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) | 9T* | 28.58 | 29.90 | 40.51 | 40.07 | 44.31 | 44.04 | 37.90 |
|
| 56 |
+
||
|
| 57 |
+
| Reference models - >1B | [EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B) | 4T* | 31.04 | 31.58 | 54.68 | 45.30 | 44.52 | 50.50 | 42.94 |
|
| 58 |
+
| | [Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base) | 36T* | 34.17 | 37.49 | 57.00 | 45.20 | 49.81 | 45.66 | 44.89 |
|
| 59 |
+
| | [BübleLM-2B](https://huggingface.co/flair/bueble-lm-2b) | 2T* | 29.68 | 32.62 | 53.63 | 46.57 | 43.55 | 49.70 | 42.63 |
|
| 60 |
+
| | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b) | 2T* | 33.99 | 37.11 | 57.47 | 49.62 | 52.64 | 48.89 | 46.62 |
|
| 61 |
+
|
| 62 |
+
## Safety & Ethics
|
| 63 |
+
|
| 64 |
+
We have not conducted systematic model evaluations of toxicity, demographic biases, or harmful stereotypes. Quality filtering may reduce some risks relative to unfiltered web data, but cannot guarantee their absence, and repeated exposure during multi-epoch training could amplify rather than mitigate encoded biases. Users should exercise caution in sensitive use-cases without further evaluation.
|
| 65 |
+
|
| 66 |
+
## Citation
|
| 67 |
+
```bibtex
|
| 68 |
+
@misc{boldt2026,
|
| 69 |
+
title={Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling},
|
| 70 |
+
author={Ansar Aynetdinov and Patrick Haller and Alan Akbik},
|
| 71 |
+
year={2026},
|
| 72 |
+
eprint={2604.28075},
|
| 73 |
+
archivePrefix={arXiv},
|
| 74 |
+
primaryClass={cs.CL},
|
| 75 |
+
url={https://arxiv.org/abs/2604.28075},
|
| 76 |
+
}
|
| 77 |
+
```
|
| 78 |
+
|