| --- |
| title: README |
| emoji: ๐ |
| colorFrom: pink |
| colorTo: green |
| sdk: static |
| pinned: false |
| --- |
| |
| **Welcome to Boldt!** |
|
|
| **Boldt** is a family of German language models developed by the **Chair of Machine Learning @ Humboldt-Universitรคt zu Berlin**. This organization hosts our **models, datasets, and research artifacts** related to the Boldt project. |
|
|
| Feel free to explore, download, and experiment with our latest releases! ๐ |
|
|
| ## ๐ The Boldt Model Family |
|
|
| Our models are trained on our German *Dense-Core* subset of FineWeb-2, utilizing a multi-epoch training recipe on high-quality data. |
|
|
| | Model | Parameters | Context Window | Description | |
| | :--- | :--- | :--- | :--- | |
| | [**Boldt-DC-350M**](https://huggingface.co/Boldt/Boldt-DC-350M) | 350M | 2048 | Ultra-lightweight base model for constrained environments. | |
| | [**Boldt-DC-1B**](https://huggingface.co/Boldt/Boldt-DC-1B) | 1B | 2048 | Highly optimized 1B base model with top-tier German performance. | |
| | [**Boldt-1B**](https://huggingface.co/Boldt/Boldt-1B) | 1B | 4096 | Extended context and augmented with 6B tokens of high-quality German news data. | |
| | [**Boldt-1B-IT-Preview**](https://huggingface.co/Boldt/Boldt-1B-IT-Preview) | 1B | 4096 | Experimental instruction-tuned model. | |
|
|
| ## ๐ Comparison |
|
|
| Boldt-1B compares favorably on German LLM benchmarks against other similarly-sized models: |
|
|
|  |
|
|
| It is even competitive with many larger (2B parameter) models. See our paper for the full evaluation. |
|
|
| ## ๐ Research & Artifacts |
| * **Paper:** [Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling (arXiv 2026)](https://arxiv.org/abs/2604.28075) |
| * **Evaluation Suite:** [Modernized German Benchmarks](https://huggingface.co/collections/Boldt/german-llm-benchmarks) |
|
|