Fill-Mask
Transformers
Safetensors
modernbert
encoder
bcms
serbian
croatian
bosnian
montenegrin
south-slavic
long-context
Eval Results (legacy)
Instructions to use permitt/galton-modernbertic-large with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use permitt/galton-modernbertic-large with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="permitt/galton-modernbertic-large")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("permitt/galton-modernbertic-large") model = AutoModelForMaskedLM.from_pretrained("permitt/galton-modernbertic-large") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - sr | |
| - hr | |
| - bs | |
| - cnr | |
| license: apache-2.0 | |
| library_name: transformers | |
| tags: | |
| - modernbert | |
| - encoder | |
| - bcms | |
| - serbian | |
| - croatian | |
| - bosnian | |
| - montenegrin | |
| - south-slavic | |
| - fill-mask | |
| - long-context | |
| pipeline_tag: fill-mask | |
| datasets: | |
| - HPLT/HPLT3.0 | |
| - HuggingFaceFW/fineweb-2 | |
| - HuggingFaceFW/finepdfs | |
| - classla/xlm-r-bertic-data | |
| metrics: | |
| - accuracy | |
| - f1 | |
| model-index: | |
| - name: ModernBERTić-large | |
| results: | |
| - task: | |
| type: text-classification | |
| name: SuperGLUE-SR (BalkanBench v1.0) | |
| dataset: | |
| type: balkanbench/superglue-sr | |
| name: SuperGLUE-SR | |
| metrics: | |
| - type: average | |
| value: 73.44 | |
| name: Average (6 tasks, 5 seeds) | |
| - type: accuracy | |
| value: 80.70 | |
| name: BoolQ | |
| - type: f1_macro | |
| value: 78.52 | |
| name: CB | |
| - type: accuracy | |
| value: 76.84 | |
| name: COPA | |
| - type: accuracy | |
| value: 73.13 | |
| name: RTE | |
| - type: f1_a | |
| value: 67.90 | |
| name: MultiRC | |
| - type: accuracy | |
| value: 63.56 | |
| name: WSC | |
| <p align="center"> | |
| <img src="modernbertic.png" alt="ModernBERTić - the first modern long-context encoder for BCMS" width="100%"/> | |
| </p> | |
| # ModernBERTić-large | |
| **The first modern-architecture encoder for Bosnian, Croatian, Montenegrin, and Serbian (BCMS).** 395M parameters, native 8192-token context, FlashAttention 2. | |
| State of the art on SuperGLUE-SR. Live leaderboard: [balkanbench.com/leaderboard](https://balkanbench.com/leaderboard). | |
| > **Looking for the smaller variant?** See [`permitt/galton-modernbertic-base`](https://huggingface.co/permitt/galton-modernbertic-base) (149M params). | |
| ## TL;DR | |
| | | | | |
| |---|---| | |
| | **Architecture** | ModernBERT-large (28 layers, 1024 hidden, 16 heads) | | |
| | **Parameters** | 395M | | |
| | **Context length** | 8192 tokens (RoPE base 160K) | | |
| | **Attention** | Sliding window 256 + global every 2nd layer, FlashAttention 2 | | |
| | **Tokenizer** | BPE, 50,304 vocab, Latin-only, cased | | |
| | **Pretraining tokens** | 66B BCMS tokens, 22 sources | | |
| | **Compute** | 64× A100-64GB on Leonardo HPC, ~10h wall clock | | |
| ## Why this model exists | |
| The de facto encoder for BCMS has been [`classla/bcms-bertic`](https://huggingface.co/classla/bcms-bertic) since 2021: 110M parameters, 512-token context, ELECTRA. Excellent within its envelope. Insufficient for production tasks that require long-document understanding (CV parsing, legal documents, retrieval over knowledge bases). | |
| ModernBERTić ports the [ModernBERT](https://huggingface.co/blog/modernbert) recipe to BCMS: | |
| - **8K native context** instead of 512, via RoPE | |
| - **FlashAttention 2 + unpadding** for ~3.5× faster inference at identical hardware | |
| - **Alternating attention** (sliding window 256 + full attention every 2nd layer) for O(n) cost on long inputs | |
| - **Latin-only BCMS-native tokenizer** at 31% lower than mmBERT's multilingual SentencePiece | |
| ## Results: SuperGLUE Serbian edition | |
| Evaluation from [BalkanBench v1.0](https://balkanbench.com/leaderboard). 5 random seeds per cell, mean reported om the website; standard deviations in the leaderboard UI. | |
| Live, sortable leaderboard with all 9 evaluated models, per-task standard deviations, and reproducibility info: **[balkanbench.com/leaderboard](https://balkanbench.com/leaderboard)**. | |
| ## Quickstart | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForMaskedLM | |
| import torch | |
| model_id = "permitt/galton-modernbertic-large" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForMaskedLM.from_pretrained( | |
| model_id, | |
| attn_implementation="flash_attention_2", # falls back to sdpa if FA2 unavailable | |
| torch_dtype=torch.bfloat16, | |
| ).to("cuda") | |
| text = "Glavni grad Crne Gore je [MASK]." | |
| inputs = tokenizer(text, return_tensors="pt").to("cuda") | |
| with torch.no_grad(): | |
| logits = model(**inputs).logits | |
| mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1] | |
| predicted = tokenizer.decode(logits[0, mask_idx].argmax(dim=-1)) | |
| print(predicted) | |
| ``` | |
| ### Long context | |
| ```python | |
| # 8192 tokens supported natively, no positional interpolation needed | |
| tokenizer.model_max_length = 8192 | |
| inputs = tokenizer(very_long_document, return_tensors="pt", truncation=True).to("cuda") | |
| ``` | |
| ### Fine-tuning | |
| ```python | |
| from transformers import AutoModelForSequenceClassification | |
| model = AutoModelForSequenceClassification.from_pretrained( | |
| "permitt/galton-modernbertic-large", | |
| num_labels=3, | |
| attn_implementation="flash_attention_2", | |
| ) | |
| # standard HF Trainer flow from here | |
| ``` | |
| **Recommended hyperparameters** (from our SuperGLUE-SR sweeps): | |
| | Task type | Learning rate | Batch size | Epochs | | |
| |-----------|---------------|------------|--------| | |
| | Sequence classification | 2e-5 to 5e-5 | 16-32 | 3-5 | | |
| | Token classification (NER, POS) | 3e-5 | 32 | 5-10 | | |
| | Long-context tasks (>512 tok) | 1e-5 to 3e-5 | 8-16 | 3-5 | | |
| ## Tokenizer | |
| | | Vocab | Tokens / character | OOV rate | | |
| |---|-------|--------------------|---------| | |
| | **ModernBERTić** | 50,304 | **0.229** | 0.000% | | |
| | BERTić | 32,000 | 0.242 | 0.006% | | |
| | XLM-R-BERTić | 250,002 | 0.274 | 0.008% | | |
| | mmBERT | 256,000 | 0.334 | 0.000% | | |
| *Measured on 55.8M characters of held-out BCMS text.* | |
| The vocabulary is Latin-only. Cyrillic input should be transliterated upstream; Cased input is preferred (uncased reduces tokenizer efficiency by ~14%). | |
| ## Pretraining | |
| - **Corpus:** 66B tokens, 227M documents, assembled from 22 BCMS sources (FineWiki, BERTić-MaCoCu, FineWeb-2, HPLT 3.0, FinePDFs, CLASSLA web, books, news, plus others). Tiered priority, BCMS-specific quality filters (gambling/content-farm/stop-word heuristics), MinHash LSH cross-source deduplication at 0.8 Jaccard threshold. | |
| - **Objective:** Masked Language Modeling, 30% masking ratio. | |
| - **Optimizer:** AdamW, peak LR 5e-4, warmup-stable-decay schedule with ~9% decay phase. | |
| - **Batch:** 4096 sequences global, kept constant across GPU counts (strong scaling). | |
| - **Precision:** bfloat16. | |
| - **Framework:** [MosaicML Composer](https://github.com/mosaicml/composer) + [FlexBERT](https://github.com/AnswerDotAI/ModernBERT). MDS streaming dataset format with deterministic resume across the 24-hour Leonardo job limit. | |
| ## Intended uses and limitations | |
| **Intended uses.** Sequence classification, token classification (NER, POS), masked language modeling, long-document understanding, and as a base model for fine-tuned retrievers and rerankers (see EmbedBERTić and RerankerBERTić, releasing soon). | |
| **Out of scope.** This is an encoder, not a generative model. It does not produce open-ended text. For text generation in BCMS, see the [national LLM initiative announced April 2026](https://www.srbija.gov.rs/) or general-purpose multilingual LLMs. | |
| **Limitations.** | |
| - **Latin script only.** Cyrillic input should be transliterated before tokenization. Raw Cyrillic falls back to byte-level encoding and burns context for no signal. | |
| - **Domain skew.** Training data is heavy on web text, news, encyclopedic content, and PDFs (academic + literary). Heavy code, conversational chat, or highly technical scientific text are underrepresented. | |
| - **Variants.** All four BCMS varieties (Bosnian, Croatian, Montenegrin, Serbian) are represented, but Croatian and Serbian dominate the corpus volume. Montenegrin in particular is upsampled 4× during mixing to compensate. | |
| ## Production note | |
| ModernBERTić powers production features at [Recrewty](https://recrewty.com), an AI-assisted talent management platform for the Balkans, including long-document CV understanding, psychometric inference, and the candidate retrieve-then-rerank pipeline. The model is the same artifact in production and on this card; nothing is held back. | |
| ## Citation | |
| ```bibtex | |
| @misc{perovic2026modernbertic, | |
| title = {{ModernBERTić}: A Modern Encoder for {BCMS} Languages}, | |
| author = {Perovic, Mitar}, | |
| year = {2026}, | |
| url = {https://huggingface.co/permitt/galton-modernbertic-large}, | |
| note = {Recrewty, EU-funded grant} | |
| } | |
| ``` | |
| ## Acknowledgments | |
| This work was developed at **[Recrewty](https://recrewty.com)** as part of an EU-funded grant. Compute on Leonardo HPC was provided through the consortium grant. | |
| Standing on the shoulders of: | |
| - **Nikola Ljubešić** and the **CLASSLA** team for BERTić, BENCHić, and the broader BCMS NLP infrastructure that made this work possible. | |
| - The **ModernBERT** team (Warner et al., 2024) for the architecture and the FlexBERT codebase. | |
| - **MosaicML / Databricks** for Composer and the MDS streaming format. | |
| - **HuggingFace** for the model hub, datasets, and `tokenizers` library. | |
| - **JeRTeh**, **ReLDI**, and the broader Serbian NLP community for datasets and evaluation resources. | |
| - **EuroHPC** and the **Leonardo** consortium for compute access. | |
| ## See also | |
| - [`permitt/galton-modernbertic-base`](https://huggingface.co/permitt/galton-modernbertic-base) - 149M parameter variant | |
| - [BalkanBench leaderboard](https://balkanbench.com/leaderboard) - live evaluation across BCMS encoders | |
| - [Build-in-public series on LinkedIn](https://linkedin.com/in/perovicmitar) - posts #0-#9 covering training data, tokenizer, distributed training, debugging, and results | |
| - [Medium release post](https://medium.com/@permitt/training-the-first-modern-architecture-encoder-for-south-slavic-languages-e2a11a4ead31) - long-form write-up of the model, the data pipeline, and lessons on data quality vs data quantity (link active at release) | |
| - [All links in one place](https://permitt.io) - You can find linkedin material from this single point | |