Boldt-DC-1B

Boldt is a series of German Small Language Models (SLMs) trained from scratch. Our initial release includes four models:

Boldt-DC-350M
Boldt-DC-1B (this model)
Boldt-1B
Boldt-1B-IT-Preview

Repetition over Diversity

The training philosophy behind Boldt is centered on a key finding from our research: repetition over diversity.

Standard pre-training paradigms typically balance quality filtering against the need for massive token volume and broad corpus diversity. In contrast, Boldt models are trained for multiple epochs on a highly filtered dataset: the German Dense-Core subset of FineWeb-2. We isolated this subset using a combination of three hierarchical filters:

Coherence: Eliminates structurally fragmented or incoherent documents.
Information Value: Isolates content-rich and fact-bearing texts.
Educational Quality: Selects strictly for pedagogical clarity and deep explanations.

We demonstrate that repeated exposure to this strict, high-quality subset is more sample-efficient than a single pass over less filtered and more diverse corpora. For a comprehensive look at our experiments, please refer to our preprint: Repetition over Diversity.

Boldt-DC-1B represents the highly optimized 1-billion parameter foundation of this methodology, trained over multiple epochs on 200B tokens of our extreme-signal dataset.

Model Architecture

Parameters: ~1 Billion
Context Window: 2048 tokens
Training Data: German Dense-Core subset (FineWeb-2) [200B tokens]
Language: German

Usage

Note: This is a base language model, not an instruction-tuned model. It is not optimized for chat or instruction following. For best results, use standard text completion rather than chat templates.

from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Boldt/Boldt-1B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Basic text completion
text = "Berlin ist eine Stadt, wo"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=64)

Evaluation

We evaluate Boldt-1B on our modernized German benchmark suite. See our paper (Aynetdinov et al., 2026) for details on the structural and translation corrections we performed.

Despite being trained on substantially fewer tokens, the Boldt-1B family outperforms other 1B-class models on German tasks and performs competitively with much larger multilingual models.

1B Weight Class (Direct Comparison)

Note: Bold text indicates the best score in the 1B category.

Model	Tokens	MMLU	ARC-C	ARC-E	H-Swag	LAMBADA	OBQA	Avg.
Boldt-DC-350M	200B	29.29	32.24	52.87	43.21	37.48	45.86	40.16
Boldt-DC-1B (this model)	200B	31.06	35.99	57.30	48.69	42.80	48.48	44.05
Boldt-1B	230B	31.42	34.11	55.78	48.77	44.70	52.32	44.52
LLäMmlein-1B	1T	29.26	30.27	48.19	44.80	44.89	47.27	40.78
Gemma-3-1B	2T*	30.01	30.55	47.89	43.43	41.71	45.05	39.77
Llama-3.2-1B	9T*	28.58	29.90	40.51	40.07	44.31	44.04	37.90
Qwen3.5-0.8B-Base	>36T*	30.79	32.05	46.20	38.90	36.02	43.84	37.97

1.7B - 2B Weight Class (Larger Reference Models)

Model	Tokens	MMLU	ARC-C	ARC-E	H-Swag	LAMBADA	OBQA	Avg.
EuroLLM-1.7B	4T*	31.04	31.58	54.68	45.30	44.52	50.50	42.94
Qwen3-1.7B-Base	36T*	34.17	37.49	57.00	45.20	49.81	45.66	44.89
BübleLM-2B	2T*	29.68	32.62	53.63	46.57	43.55	49.70	42.63
Gemma-2-2B	2T*	33.99	37.11	57.47	49.62	52.64	48.89	46.62
Gemma-4-E2B	N/A	34.48	41.14	63.16	55.22	55.96	50.51	50.08

Safety & Ethics

We have not conducted systematic model evaluations of toxicity, demographic biases, or harmful stereotypes. Quality filtering may reduce some risks relative to unfiltered web data, but cannot guarantee their absence, and repeated exposure during multi-epoch training could amplify rather than mitigate encoded biases. Users should exercise caution in sensitive use-cases without further evaluation.

Citation

@misc{boldt2026,
      title={Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling}, 
      author={Ansar Aynetdinov and Patrick Haller and Alan Akbik},
      year={2026},
      eprint={2604.28075},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.28075}, 
}

Downloads last month: 144

Safetensors

Model size

1B params

Tensor type

BF16

Collection including Boldt/Boldt-DC-1B

Boldt Models

Collection

Initial model release • 4 items • Updated 3 days ago • 1

Paper for Boldt/Boldt-DC-1B

Repetition over Diversity: High-Signal Data Filtering for Sample-Efficient German Language Modeling

Paper • 2604.28075 • Published 7 days ago • 18