permitt's picture
Update README.md
2addc46 verified
---
language:
- sr
- hr
- bs
- cnr
license: apache-2.0
library_name: transformers
tags:
- modernbert
- encoder
- bcms
- serbian
- croatian
- bosnian
- montenegrin
- south-slavic
- fill-mask
- long-context
pipeline_tag: fill-mask
datasets:
- HPLT/HPLT3.0
- HuggingFaceFW/fineweb-2
- HuggingFaceFW/finepdfs
- classla/xlm-r-bertic-data
metrics:
- accuracy
- f1
model-index:
- name: ModernBERTić-base
results:
- task:
type: text-classification
name: SuperGLUE-SR (BalkanBench v1.0)
dataset:
type: balkanbench/superglue-sr
name: SuperGLUE-SR
metrics:
- type: average
value: 69.73
name: Average (6 tasks, 5 seeds)
- type: accuracy
value: 76.02
name: BoolQ
- type: f1_macro
value: 76.96
name: CB
- type: accuracy
value: 65.76
name: COPA
- type: accuracy
value: 65.82
name: RTE
- type: f1_a
value: 66.90
name: MultiRC
- type: accuracy
value: 64.11
name: WSC
---
<p align="center">
<img src="modernbertic.png" alt="ModernBERTić - the first modern long-context encoder for BCMS" width="100%"/>
</p>
# ModernBERTić-base
**A modern-architecture encoder for Bosnian, Croatian, Montenegrin, and Serbian (BCMS).** 149M parameters, native 8192-token context, FlashAttention 2.
> **For best downstream task performance, use the [large variant](https://huggingface.co/permitt/galton-modernbertic-large)** (395M, SOTA on SuperGLUE-SR). This base model is intended for fast inference, retrieval encoders where latency matters, and as a starting point for further pretraining or domain adaptation.
## TL;DR
| | |
|---|---|
| **Architecture** | ModernBERT-base (22 layers, 768 hidden, 12 heads) |
| **Parameters** | 149M |
| **Context length** | 8192 tokens (RoPE base 160K) |
| **Attention** | Sliding window 128 + global every 3rd layer, FlashAttention 2 |
| **Tokenizer** | BPE, 50,304 vocab, Latin-only, cased (shared with `galton-modernbertic-large`) |
| **Pretraining tokens** | 60B BCMS tokens, 22 sources |
## Honest performance note
On SuperGLUE-SR ([BalkanBench v1.0](https://balkanbench.com/leaderboard)), ModernBERTić-base scores **69.73 average**, which sits **below BERTić's 71.46** at smaller parameter count. The story is consistent with what the literature predicts:
> Masked Language Modeling at 30% masking gives a training signal on 30% of tokens. ELECTRA-style replaced-token detection (BERTić) gives a signal on 100% of tokens. At small capacities, that supervision deficit is not absorbed by architectural improvements. At larger capacities (see [`galton-modernbertic-large`](https://huggingface.co/permitt/galton-modernbertic-large)), it is.
We are publishing the base model anyway because:
1. **Inference is much faster.** ~2-3× the throughput of the large variant on identical hardware, useful for high-volume retrieval encoders, candidate filtering, and embedding workloads.
2. **It is the right starting point for further pretraining.** If you are domain-adapting to legal, medical, or other specialized BCMS text, the base scale is the correct continued-pretraining target.
3. **It is an honest baseline material** for the encoder-comparison community working on BCMS.
If you want SOTA on a downstream classification or reasoning task, **use the [large variant](https://huggingface.co/permitt/galton-modernbertic-large)**. If you want a small, fast, modern-architecture encoder for embedding work or further pretraining, this one is for you.
## Results: SuperGLUE Serbian edition
Evaluation from [BalkanBench v1.0](https://balkanbench.com/leaderboard). 5 random seeds per cell, mean reported on the website; standard deviations in the leaderboard UI.
Live, sortable leaderboard with all 9 evaluated models, per-task standard deviations, and reproducibility info: **[balkanbench.com/leaderboard](https://balkanbench.com/leaderboard)**.
## Quickstart
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
model_id = "permitt/galton-modernbertic-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
model_id,
attn_implementation="flash_attention_2", # falls back to sdpa if FA2 unavailable
torch_dtype=torch.bfloat16,
).to("cuda")
text = "Glavni grad Crne Gore je [MASK]."
inputs = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
logits = model(**inputs).logits
mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted = tokenizer.decode(logits[0, mask_idx].argmax(dim=-1))
print(predicted) # "Podgorica"
```
### Fine-tuning
```python
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"permitt/galton-modernbertic-base",
num_labels=3,
attn_implementation="flash_attention_2",
)
# standard HF Trainer flow from here
```
**Recommended hyperparameters:**
| Task type | Learning rate | Batch size | Epochs |
|-----------|---------------|------------|--------|
| Sequence classification | 3e-5 to 7e-5 | 16-32 | 3-5 |
| Token classification (NER, POS) | 5e-5 | 32 | 5-10 |
The base model wants larger learning rates than the large variant (the loss landscape at smaller scale is less curved). A grid that works on this base does not transfer to ModernBERTić-large.
### Continued pretraining
The base model is a reasonable starting point for domain adaptation. Use the same MLM objective at 30% masking, peak LR ~1e-4 (one decade below pretraining peak), warmup over the first 10% of your continued-pretraining tokens. Expect to need ~1-5B in-domain tokens depending on how distant your target domain is from web/news/encyclopedic text.
## Tokenizer
Identical to the [large variant](https://huggingface.co/permitt/galton-modernbertic-large): BPE, 50,304 vocab, Latin-only, cased. **Tokens per character: 0.229** on held-out BCMS text, 31% lower than mmBERT's multilingual SentencePiece. Cyrillic input should be transliterated upstream. Cased input is preferred (uncased reduces tokenizer efficiency by ~14%).
## Pretraining
Identical recipe to the large variant, scaled down to base configuration:
- **Corpus:** 60B tokens, 227M documents, 22 BCMS sources, tiered priority, MinHash LSH cross-source deduplication.
- **Objective:** Masked Language Modeling, 30% masking ratio.
- **Optimizer:** AdamW, peak LR 8e-4 (higher than large because smaller model), warmup-stable-decay.
- **Batch:** 4096 sequences global.
- **Precision:** bfloat16.
- **Framework:** MosaicML Composer + FlexBERT.
See the [large variant card](https://huggingface.co/permitt/galton-modernbertic-large) for the detailed pipeline write-up.
## Intended uses and limitations
**Intended uses.**
- Fine-tuning starting point where inference latency matters more than peak accuracy.
- Continued pretraining for domain-adapted BCMS encoders (legal, medical, technical).
- Embedding model fine-tunes where 149M is the right size for the latency budget.
- Token classification tasks (NER, POS) where the base scale is sufficient.
**Out of scope.**
- High-stakes downstream classification or reasoning tasks where you want SOTA. Use [`galton-modernbertic-large`](https://huggingface.co/permitt/galton-modernbertic-large) instead.
- Generative tasks. This is an encoder, not a generative model. For text generation in BCMS, see the [national LLM initiative announced April 2026](https://www.srbija.gov.rs/) or general-purpose multilingual LLMs.
- Languages outside BCMS. The tokenizer is Latin-only and the corpus is BCMS-only.
**Limitations.**
- **Latin script only.** Cyrillic input should be transliterated before tokenization. Raw Cyrillic falls back to byte-level encoding and burns context for no signal.
- **Domain skew.** Training data is heavy on web text, news, encyclopedic content, and PDFs (academic + literary). Heavy code, conversational chat, or highly technical scientific text are underrepresented.
- **Variants.** All four BCMS varieties (Bosnian, Croatian, Montenegrin, Serbian) are represented, but Croatian and Serbian dominate the corpus volume. Montenegrin in particular is upsampled 4× during mixing to compensate.
## Citation
```bibtex
@misc{perovic2026modernbertic,
title = {{ModernBERTić}: A Modern Encoder for {BCMS} Languages},
author = {Perovic, Mitar},
year = {2026},
url = {https://huggingface.co/permitt/galton-modernbertic-base},
note = {Recrewty, EU-funded grant}
}
```
## Acknowledgments
This work was developed at **[Recrewty](https://recrewty.com)** as part of an EU-funded grant. Compute on Leonardo HPC was provided through the consortium grant.
Standing on the shoulders of:
- **Nikola Ljubešić** and the **CLASSLA** team for BERTić, BENCHić, and the broader BCMS NLP infrastructure that made this work possible.
- The **ModernBERT** team (Warner et al., 2024) for the architecture and the FlexBERT codebase.
- **MosaicML / Databricks** for Composer and the MDS streaming format.
- **HuggingFace** for the model hub, datasets, and `tokenizers` library.
- **JeRTeh**, **ReLDI**, and the broader Serbian NLP community for datasets and evaluation resources.
- **EuroHPC** and the **Leonardo** consortium for compute access.
## See also
- [`permitt/galton-modernbertic-large`](https://huggingface.co/permitt/galton-modernbertic-large) - 395M parameter variant, SOTA on SuperGLUE-SR
- [BalkanBench leaderboard](https://balkanbench.com/leaderboard) - live evaluation across BCMS encoders
- [Build-in-public series on LinkedIn](https://linkedin.com/in/perovicmitar) - posts #0-#9 covering training data, tokenizer, distributed training, debugging, and results
- [Medium release post](https://medium.com/@permitt/training-the-first-modern-architecture-encoder-for-south-slavic-languages-e2a11a4ead31) - long-form write-up of the model, the data pipeline, and lessons on data quality vs data quantity (link active at release)
- [All links in one place](https://permitt.io) - You can find linkedin material from this single point