permitt's picture
Update README.md
59a426c verified
---
language:
- sr
- hr
- bs
- cnr
license: apache-2.0
library_name: transformers
tags:
- modernbert
- encoder
- bcms
- serbian
- croatian
- bosnian
- montenegrin
- south-slavic
- fill-mask
- long-context
pipeline_tag: fill-mask
datasets:
- HPLT/HPLT3.0
- HuggingFaceFW/fineweb-2
- HuggingFaceFW/finepdfs
- classla/xlm-r-bertic-data
metrics:
- accuracy
- f1
model-index:
- name: ModernBERTić-large
results:
- task:
type: text-classification
name: SuperGLUE-SR (BalkanBench v1.0)
dataset:
type: balkanbench/superglue-sr
name: SuperGLUE-SR
metrics:
- type: average
value: 73.44
name: Average (6 tasks, 5 seeds)
- type: accuracy
value: 80.70
name: BoolQ
- type: f1_macro
value: 78.52
name: CB
- type: accuracy
value: 76.84
name: COPA
- type: accuracy
value: 73.13
name: RTE
- type: f1_a
value: 67.90
name: MultiRC
- type: accuracy
value: 63.56
name: WSC
---
<p align="center">
<img src="modernbertic.png" alt="ModernBERTić - the first modern long-context encoder for BCMS" width="100%"/>
</p>
# ModernBERTić-large
**The first modern-architecture encoder for Bosnian, Croatian, Montenegrin, and Serbian (BCMS).** 395M parameters, native 8192-token context, FlashAttention 2.
State of the art on SuperGLUE-SR. Live leaderboard: [balkanbench.com/leaderboard](https://balkanbench.com/leaderboard).
> **Looking for the smaller variant?** See [`permitt/galton-modernbertic-base`](https://huggingface.co/permitt/galton-modernbertic-base) (149M params).
## TL;DR
| | |
|---|---|
| **Architecture** | ModernBERT-large (28 layers, 1024 hidden, 16 heads) |
| **Parameters** | 395M |
| **Context length** | 8192 tokens (RoPE base 160K) |
| **Attention** | Sliding window 256 + global every 2nd layer, FlashAttention 2 |
| **Tokenizer** | BPE, 50,304 vocab, Latin-only, cased |
| **Pretraining tokens** | 66B BCMS tokens, 22 sources |
| **Compute** | 64× A100-64GB on Leonardo HPC, ~10h wall clock |
## Why this model exists
The de facto encoder for BCMS has been [`classla/bcms-bertic`](https://huggingface.co/classla/bcms-bertic) since 2021: 110M parameters, 512-token context, ELECTRA. Excellent within its envelope. Insufficient for production tasks that require long-document understanding (CV parsing, legal documents, retrieval over knowledge bases).
ModernBERTić ports the [ModernBERT](https://huggingface.co/blog/modernbert) recipe to BCMS:
- **8K native context** instead of 512, via RoPE
- **FlashAttention 2 + unpadding** for ~3.5× faster inference at identical hardware
- **Alternating attention** (sliding window 256 + full attention every 2nd layer) for O(n) cost on long inputs
- **Latin-only BCMS-native tokenizer** at 31% lower than mmBERT's multilingual SentencePiece
## Results: SuperGLUE Serbian edition
Evaluation from [BalkanBench v1.0](https://balkanbench.com/leaderboard). 5 random seeds per cell, mean reported om the website; standard deviations in the leaderboard UI.
Live, sortable leaderboard with all 9 evaluated models, per-task standard deviations, and reproducibility info: **[balkanbench.com/leaderboard](https://balkanbench.com/leaderboard)**.
## Quickstart
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch
model_id = "permitt/galton-modernbertic-large"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
model_id,
attn_implementation="flash_attention_2", # falls back to sdpa if FA2 unavailable
torch_dtype=torch.bfloat16,
).to("cuda")
text = "Glavni grad Crne Gore je [MASK]."
inputs = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
logits = model(**inputs).logits
mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted = tokenizer.decode(logits[0, mask_idx].argmax(dim=-1))
print(predicted)
```
### Long context
```python
# 8192 tokens supported natively, no positional interpolation needed
tokenizer.model_max_length = 8192
inputs = tokenizer(very_long_document, return_tensors="pt", truncation=True).to("cuda")
```
### Fine-tuning
```python
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"permitt/galton-modernbertic-large",
num_labels=3,
attn_implementation="flash_attention_2",
)
# standard HF Trainer flow from here
```
**Recommended hyperparameters** (from our SuperGLUE-SR sweeps):
| Task type | Learning rate | Batch size | Epochs |
|-----------|---------------|------------|--------|
| Sequence classification | 2e-5 to 5e-5 | 16-32 | 3-5 |
| Token classification (NER, POS) | 3e-5 | 32 | 5-10 |
| Long-context tasks (>512 tok) | 1e-5 to 3e-5 | 8-16 | 3-5 |
## Tokenizer
| | Vocab | Tokens / character | OOV rate |
|---|-------|--------------------|---------|
| **ModernBERTić** | 50,304 | **0.229** | 0.000% |
| BERTić | 32,000 | 0.242 | 0.006% |
| XLM-R-BERTić | 250,002 | 0.274 | 0.008% |
| mmBERT | 256,000 | 0.334 | 0.000% |
*Measured on 55.8M characters of held-out BCMS text.*
The vocabulary is Latin-only. Cyrillic input should be transliterated upstream; Cased input is preferred (uncased reduces tokenizer efficiency by ~14%).
## Pretraining
- **Corpus:** 66B tokens, 227M documents, assembled from 22 BCMS sources (FineWiki, BERTić-MaCoCu, FineWeb-2, HPLT 3.0, FinePDFs, CLASSLA web, books, news, plus others). Tiered priority, BCMS-specific quality filters (gambling/content-farm/stop-word heuristics), MinHash LSH cross-source deduplication at 0.8 Jaccard threshold.
- **Objective:** Masked Language Modeling, 30% masking ratio.
- **Optimizer:** AdamW, peak LR 5e-4, warmup-stable-decay schedule with ~9% decay phase.
- **Batch:** 4096 sequences global, kept constant across GPU counts (strong scaling).
- **Precision:** bfloat16.
- **Framework:** [MosaicML Composer](https://github.com/mosaicml/composer) + [FlexBERT](https://github.com/AnswerDotAI/ModernBERT). MDS streaming dataset format with deterministic resume across the 24-hour Leonardo job limit.
## Intended uses and limitations
**Intended uses.** Sequence classification, token classification (NER, POS), masked language modeling, long-document understanding, and as a base model for fine-tuned retrievers and rerankers (see EmbedBERTić and RerankerBERTić, releasing soon).
**Out of scope.** This is an encoder, not a generative model. It does not produce open-ended text. For text generation in BCMS, see the [national LLM initiative announced April 2026](https://www.srbija.gov.rs/) or general-purpose multilingual LLMs.
**Limitations.**
- **Latin script only.** Cyrillic input should be transliterated before tokenization. Raw Cyrillic falls back to byte-level encoding and burns context for no signal.
- **Domain skew.** Training data is heavy on web text, news, encyclopedic content, and PDFs (academic + literary). Heavy code, conversational chat, or highly technical scientific text are underrepresented.
- **Variants.** All four BCMS varieties (Bosnian, Croatian, Montenegrin, Serbian) are represented, but Croatian and Serbian dominate the corpus volume. Montenegrin in particular is upsampled 4× during mixing to compensate.
## Production note
ModernBERTić powers production features at [Recrewty](https://recrewty.com), an AI-assisted talent management platform for the Balkans, including long-document CV understanding, psychometric inference, and the candidate retrieve-then-rerank pipeline. The model is the same artifact in production and on this card; nothing is held back.
## Citation
```bibtex
@misc{perovic2026modernbertic,
title = {{ModernBERTić}: A Modern Encoder for {BCMS} Languages},
author = {Perovic, Mitar},
year = {2026},
url = {https://huggingface.co/permitt/galton-modernbertic-large},
note = {Recrewty, EU-funded grant}
}
```
## Acknowledgments
This work was developed at **[Recrewty](https://recrewty.com)** as part of an EU-funded grant. Compute on Leonardo HPC was provided through the consortium grant.
Standing on the shoulders of:
- **Nikola Ljubešić** and the **CLASSLA** team for BERTić, BENCHić, and the broader BCMS NLP infrastructure that made this work possible.
- The **ModernBERT** team (Warner et al., 2024) for the architecture and the FlexBERT codebase.
- **MosaicML / Databricks** for Composer and the MDS streaming format.
- **HuggingFace** for the model hub, datasets, and `tokenizers` library.
- **JeRTeh**, **ReLDI**, and the broader Serbian NLP community for datasets and evaluation resources.
- **EuroHPC** and the **Leonardo** consortium for compute access.
## See also
- [`permitt/galton-modernbertic-base`](https://huggingface.co/permitt/galton-modernbertic-base) - 149M parameter variant
- [BalkanBench leaderboard](https://balkanbench.com/leaderboard) - live evaluation across BCMS encoders
- [Build-in-public series on LinkedIn](https://linkedin.com/in/perovicmitar) - posts #0-#9 covering training data, tokenizer, distributed training, debugging, and results
- [Medium release post](https://medium.com/@permitt/training-the-first-modern-architecture-encoder-for-south-slavic-languages-e2a11a4ead31) - long-form write-up of the model, the data pipeline, and lessons on data quality vs data quantity (link active at release)
- [All links in one place](https://permitt.io) - You can find linkedin material from this single point