Boldt
/

Boldt-1B

@@ -9,7 +9,7 @@ library_name: transformers
 <img src="logo.png" width="500">
-**Boldt** is a series of German Small Language Models (SLMs) trained from scratch. Our inital release includes three models:
 - [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M)
 - [Boldt-DC-1B](https://huggingface.co/Boldt/Boldt-DC-1B)
@@ -44,22 +44,32 @@ outputs = model.generate(**inputs, max_new_tokens=64)
 ## Evaluation
-We evaluate Boldt-1B on our [modernized German benchmark suite](https://huggingface.co/collections/Boldt/german-llm-benchmarks). It comprises the German subset of [Global MMLU](https://huggingface.co/datasets/CohereLabs/Global-MMLU) and updated translations of widely used English benchmarks, produced using [Tower+ 72B](https://huggingface.co/Unbabel/Tower-Plus-72B) to address issues we identified in existing German benchmark translations.
-Despite being trained on a substantially smaller amount of data, Boldt-1B outperforms other similarly-sized SLMs capable of German on our evaluation suite. It also performs competitively with larger-sized(around 2B) multilingual models.
-| Category | Model | Tokens | MMLU | ARC-C | ARC-E | H-Swag | LAMBADA | OBQA | Avg. |
-|----------|--------|--------|------|-------|-------|--------|----------|------|------|
-| Ours | [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M)  | 200B | 29.29 | 32.24 | 52.87 | 43.21 | 37.48 | 45.86 | 40.16 |
-|  | [Boldt-DC-1B](https://huggingface.co/Boldt/Boldt-DC-1B)  | 200B | 31.06 | **35.99** | **57.30** | *48.69* | 42.80 | *48.48* | 44.05 |
-|  | **Boldt-1B** | 230B | **31.42** | *34.11* | *55.78* | **48.77** | *44.70* | **52.32** | **44.52** |
-| Reference models - 1B | [LLäMmlein-1B](https://huggingface.co/LSX-UniWue/LLaMmlein_1B) | 1T | 29.26 | 30.27 | 48.19 | 44.80 | **44.89** | 47.27 | 40.78 |
-|  | [Gemma-3-1B](https://huggingface.co/google/gemma-3-1b-pt) | 2T* | 30.01 | 30.55 | 47.89 | 43.43 | 41.71 | 45.05 | 39.77 |
-|  | [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) | 9T* | 28.58 | 29.90 | 40.51 | 40.07 | 44.31 | 44.04 | 37.90 |
-||
-| Reference models - >1B | [EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B) | 4T* | 31.04 | 31.58 | 54.68 | 45.30 | 44.52 | 50.50 | 42.94 |
-|  | [Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base) | 36T* | 34.17 | 37.49 | 57.00 | 45.20 | 49.81 | 45.66 | 44.89 |
-|  | [BübleLM-2B](https://huggingface.co/flair/bueble-lm-2b) | 2T* | 29.68 | 32.62 | 53.63 | 46.57 | 43.55 | 49.70 | 42.63 |
-|  | [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b) | 2T* | 33.99 | 37.11 | 57.47 | 49.62 | 52.64 | 48.89 | 46.62 |
 ## Safety & Ethics

 <img src="logo.png" width="500">
+**Boldt** is a series of German Small Language Models (SLMs) trained from scratch. Our inital release includes four models:
 - [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M)
 - [Boldt-DC-1B](https://huggingface.co/Boldt/Boldt-DC-1B)
 ## Evaluation
+![Boldt-1B Performance Comparison](boldt_1b_evaluation.png)
+We evaluate Boldt-1B on our [modernized German benchmark suite](https://huggingface.co/collections/Boldt/german-llm-benchmarks). See our paper [(Aynetdinov et al., 2026)](https://arxiv.org/abs/2604.28075) for details on the structural and translation corrections we performed.
+Despite being trained on substantially fewer tokens, the Boldt-1B family outperforms other 1B-class models on German tasks and performs competitively with much larger multilingual models.
+### 1B Weight Class (Direct Comparison)
+*Note: Bold text indicates the best score in the 1B category.*
+| Model | Tokens | MMLU | ARC-C | ARC-E | H-Swag | LAMBADA | OBQA | Avg. |
+| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
+| [Boldt-DC-350M](https://huggingface.co/Boldt/Boldt-DC-350M) | 200B | 29.29 | 32.24 | 52.87 | 43.21 | 37.48 | 45.86 | 40.16 |
+| [Boldt-DC-1B](https://huggingface.co/Boldt/Boldt-DC-1B) | 200B | 31.06 | **35.99** | **57.30** | 48.69 | 42.80 | 48.48 | 44.05 |
+| **Boldt-1B (Ours)** | 230B | **31.42** | 34.11 | 55.78 | **48.77** | 44.70 | **52.32** | **44.52** |
+| [LLäMmlein-1B](https://huggingface.co/LSX-UniWue/LLaMmlein_1B) | 1T | 29.26 | 30.27 | 48.19 | 44.80 | **44.89** | 47.27 | 40.78 |
+| [Gemma-3-1B](https://huggingface.co/google/gemma-3-1b-pt) | 2T* | 30.01 | 30.55 | 47.89 | 43.43 | 41.71 | 45.05 | 39.77 |
+| [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B) | 9T* | 28.58 | 29.90 | 40.51 | 40.07 | 44.31 | 44.04 | 37.90 |
+### 1.7B - 2B Weight Class (Larger Reference Models)
+| Model | Tokens | MMLU | ARC-C | ARC-E | H-Swag | LAMBADA | OBQA | Avg. |
+| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
+| [EuroLLM-1.7B](https://huggingface.co/utter-project/EuroLLM-1.7B) | 4T* | 31.04 | 31.58 | 54.68 | 45.30 | 44.52 | 50.50 | 42.94 |
+| [Qwen3-1.7B-Base](https://huggingface.co/Qwen/Qwen3-1.7B-Base) | 36T* | **34.17** | **37.49** | 57.00 | 45.20 | 49.81 | 45.66 | 44.89 |
+| [BübleLM-2B](https://huggingface.co/flair/bueble-lm-2b) | 2T* | 29.68 | 32.62 | 53.63 | 46.57 | 43.55 | 49.70 | 42.63 |
+| [Gemma-2-2B](https://huggingface.co/google/gemma-2-2b) | 2T* | 33.99 | 37.11 | **57.47** | **49.62** | **52.64** | **48.89** | **46.62** |
 ## Safety & Ethics