image/png

Apertus EstLLM 8B 0326 Instruct

Llama-3.1-EstLLM-8B-Instruct-0326 is obtained by applying the chat-vector merge approach to tartuNLP/Apertus-EstLLM-8B-Instruct-1125.

Use with transformers

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "tartuNLP/Apertus-EstLLM-8B-Instruct-0326"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto"
)

# to use on apple silicon, load the following way
# model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     dtype=torch.float16,
#     device_map="mps",
# )

tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [
    {"role": "user", "content": "Kas sa räägid eesti keelt?"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer(text, return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=128,
    do_sample=True,
    temperature=0.4,
    # specify eos token to stop at the end of the assistant response
    eos_token_id=tokenizer.eos_token_id,
)

# generated_ids include the input tokens as well, so we only decode new tokens
response = tokenizer.decode(
    generated_ids[0][model_inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)

print(response)

Evaluation

Logits-based

Scores for logits-based evaluation benchmarks are available on the EuroEval leaderboard.

Generative

Every benchmark in this category is treated as a generative problem, and thus the evaluation is performed on the model responses obtained with 0 temperature (not logits). The top scores are higlighted with bold. Second best scores are highlighted with italic bold. Rows are sorted in descending order based on the number of parameters of models (not scores). The test set is used for evaluation of each dataset unless noted otherwise.

Note that all models are evaluated with the same prompt template for comparability, meaning that the scores do not necessarily represent each model's best possible performance. This is especially the case for deepseek-ai/DeepSeek-V3-0324 on some of the benchmarks.

Only models of comparable size are evaluated on benchmarks in English.

Instruction-following

Estonian

Instruction level strict accuracy is reported for IFEval-et.

Model (# parameters ↓) IFEval-et
moonshotai/Kimi-K2-Instruct 0.7891
deepseek-ai/DeepSeek-V3.2 0.7221
deepseek-ai/DeepSeek-V3-0324 0.7171
mistralai/Mistral-Large-3-675B-Instruct-2512 0.7097
meta-llama/Llama-3.1-405B-Instruct 0.7159
meta-llama/Llama-3.3-70B-Instruct 0.7705
Qwen/Qwen2.5-72B-Instruct 0.7407
google/gemma-3-27b-it 0.7655
google/gemma-3-12b-it 0.7556
utter-project/EuroLLM-9B-Instruct-2512 0.5571
utter-project/EuroLLM-9B-Instruct 0.5397
mistralai/Ministral-3-8B-Instruct-2512 0.4888
tartuNLP/Apertus-EstLLM-8B-Instruct-0326 0.5608
tartuNLP/Apertus-EstLLM-8B-Instruct-1125 0.4665
swiss-ai/Apertus-8B-Instruct-2509 0.5484
meta-llama/Llama-3.1-8B-Instruct 0.3797
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125 0.6141
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 0.5174
BSC-LT/salamandra-7b-instruct 0.5195
tartuNLP/Llammas 0.3524
Qwen/Qwen2.5-7B-Instruct 0.4988
CohereLabs/tiny-aya-global 0.6687

English

Instruction level strict accuracy is reported for IFEval-en.

Model (# parameters ↓) IFEval-en
utter-project/EuroLLM-9B-Instruct-2512 0.7564
utter-project/EuroLLM-9B-Instruct 0.7004
mistralai/Ministral-3-8B-Instruct-2512 0.6845
tartuNLP/Apertus-EstLLM-8B-Instruct-0326 0.7089
tartuNLP/Apertus-EstLLM-8B-Instruct-1125 0.6638
swiss-ai/Apertus-8B-Instruct-2509 0.7808
meta-llama/Llama-3.1-8B-Instruct 0.8106
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125 0.8173
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 0.7527
tartuNLP/Llammas 0.4373
BSC-LT/salamandra-7b-instruct 0.3289
Qwen/Qwen2.5-7B-Instruct 0.7954

Multiple Choice

All datasets except Winogrande-et are evaluated in 0-shot mode. Winogrande-et is evaluated in 3-shot mode. Exact match accuracy is reported for every dataset.

Estonian Language Competence

Model (# parameters ↓) Grammar-et Inflection-et Word-Meanings-et
moonshotai/Kimi-K2-Instruct 0.916 0.6458 0.9689
deepseek-ai/DeepSeek-V3.2 0.781 0.6891 0.8134
deepseek-ai/DeepSeek-V3-0324 0.364 0 0
mistralai/Mistral-Large-3-675B-Instruct-2512 0.796 0.8355 0.9488
meta-llama/Llama-3.1-405B-Instruct 0.818 0.9089 0.9438
meta-llama/Llama-3.3-70B-Instruct 0.797 0.6421 0.9408
Qwen/Qwen2.5-72B-Instruct 0.694 0.5208 0.9057
google/gemma-3-27b-it 0.817 0.5934 0.9529
google/gemma-3-12b-it 0.789 0.4227 0.9318
utter-project/EuroLLM-9B-Instruct-2512 0.644 0.4466 0.9288
utter-project/EuroLLM-9B-Instruct 0.764 0.367 0.9258
mistralai/Ministral-3-8B-Instruct-2512 0.562 0.4833 0.8395
tartuNLP/Apertus-EstLLM-8B-Instruct-0326 0.713 0.4326 0.9438
tartuNLP/Apertus-EstLLM-8B-Instruct-1125 0.646 0.421 0.9178
swiss-ai/Apertus-8B-Instruct-2509 0.512 0.3662 0.9027
meta-llama/Llama-3.1-8B-Instruct 0.657 0.4165 0.8335
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125 0.8310 0.5777 0.9619
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 0.692 0.5188 0.9569
BSC-LT/salamandra-7b-instruct 0.594 0.2668 0.8084
Qwen/Qwen2.5-7B-Instruct 0.598 0.4136 0.7984
tartuNLP/Llammas 0.529 0.2289 0.5326
CohereLabs/tiny-aya-global 0.563 0.3221 0.8455

Knowledge and Reasoning (Estonian)

Model (# parameters ↓) Winogrande-et Trivia-et Exam-et GlobalPIQA-et TruthfulQA-et
moonshotai/Kimi-K2-Instruct 0.8138 0.4225 0.8414 0.79 0.7136
deepseek-ai/DeepSeek-V3.2 0.4805 0.38 0.614 0.7 0.5863
deepseek-ai/DeepSeek-V3-0324 0.8042 0.27 0.1221 0.04 0.2093
mistralai/Mistral-Large-3-675B-Instruct-2512 0.7487 0.4275 0.7931 0.73 0.6854
meta-llama/Llama-3.1-405B-Instruct 0.7878 0.4713 0.8309 0.58 0.7001
meta-llama/Llama-3.3-70B-Instruct 0.7397 0.3875 0.7652 0.58 0.6255
Qwen/Qwen2.5-72B-Instruct 0.7227 0.315 0.7162 0.65 0.6683
google/gemma-3-27b-it 0.7510 0.325 0.7751 0.71 0.5814
google/gemma-3-12b-it 0.6712 0.3237 0.7069 0.54 0.3158
utter-project/EuroLLM-9B-Instruct-2512 0.5195 0.375 0.6097 0.52 0.399
utter-project/EuroLLM-9B-Instruct 0.5846 0.3738 0.5589 0.55 0.2889
mistralai/Ministral-3-8B-Instruct-2512 0.5812 0.3125 0.5012 0.48 0.3525
tartuNLP/Apertus-EstLLM-8B-Instruct-0326 0.5976 0.35 0.6022 0.64 0.4296
tartuNLP/Apertus-EstLLM-8B-Instruct-1125 0.5467 0.3575 0.5651 0.63 0.3696
swiss-ai/Apertus-8B-Instruct-2509 0.5105 0.345 0.552 0.59 0.366
meta-llama/Llama-3.1-8B-Instruct 0.5399 0.2888 0.5 0.54 0.437
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125 0.6440 0.4288 0.6332 0.68 0.3794
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 0.5812 0.425 0.5093 0.63 0.3525
BSC-LT/salamandra-7b-instruct 0.2878 0.2875 0.3556 0.55 0.3011
Qwen/Qwen2.5-7B-Instruct 0.5473 0.2938 0.4913 0.57 0.4113
tartuNLP/Llammas 0.5037 0.2838 0.3649 0.01 0.2032
CohereLabs/tiny-aya-global 0.5603 0.31 0.5638 0.52 0.3782

Knowledge and Reasoning (English)

Model (# parameters ↓) Winogrande GlobalPIQA-en TruthfulQA MMLU-Redux GSM8K
utter-project/EuroLLM-9B-Instruct-2512 0.5546 0.58 0.4614 0.6334 0.4139
utter-project/EuroLLM-9B-Instruct 0.5059 0.58 0.2962 0.5741 0.5944
mistralai/Ministral-3-8B-Instruct-2512 0.6503 0.77 0.519 0.7418 0.3927
tartuNLP/Apertus-EstLLM-8B-Instruct-0326 0.5699 0.69 0.4174 0.5946 0.5588
tartuNLP/Apertus-EstLLM-8B-Instruct-1125 0.5348 0.56 0.3647 0.5944 0.5277
swiss-ai/Apertus-8B-Instruct-2509 0.5133 0.73 0.3831 0.6099 0.5936
meta-llama/Llama-3.1-8B-Instruct 0.5625 0.76 0.5239 0.6959 0.7710
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125 0.6118 0.76 0.3635 0.6606 0.7726
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 0.6084 0.71 0.366 0.6388 0.7202
tartuNLP/Llammas 0.498 0 0.1971 0.3417 0.1456
BSC-LT/salamandra-7b-instruct 0.4029 0.63 0.2717 0.5180 0.0076
Qwen/Qwen2.5-7B-Instruct 0.6627 0.83 0.5875 0.7555 0.7862

Translation

English to Estonian

Model wmt24pp (BLEU ↑)
BSC-LT/salamandraTA-7b-instruct 0.2713
tartuNLP/Apertus-EstLLM-8B-Instruct-0326 0.2676
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125 0.2635
tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 0.264
tartuNLP/Apertus-EstLLM-8B-Instruct-1125 0.2609
utter-project/EuroLLM-9B-Instruct 0.2602
utter-project/EuroLLM-9B-Instruct-2512 0.2567
swiss-ai/Apertus-8B-Instruct-2509 0.2372
tartuNLP/Llammas 0.1472
meta-llama/Llama-3.1-8B-Instruct 0.1406
BSC-LT/salamandra-7b-instruct 0.1201
Qwen/Qwen2.5-7B-Instruct 0.0476

Citation

@misc{dorkin2026estllmenhancingestoniancapabilities,
      title={{EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training}}, 
      author={Aleksei Dorkin and Taido Purason and Emil Kalbaliyev and Hele-Andra Kuulmets and Marii Ojastu and Mark Fišel and Tanel Alumäe and Eleri Aedmaa and Krister Kruusmaa and Kairit Sirts},
      year={2026},
      eprint={2603.02041},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.02041}, 
}
Downloads last month
1,107
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tartuNLP/Apertus-EstLLM-8B-Instruct-0326

Paper for tartuNLP/Apertus-EstLLM-8B-Instruct-0326