Fill-Mask
Transformers
Safetensors
modernbert
encoder
bcms
serbian
croatian
bosnian
montenegrin
south-slavic
long-context
Eval Results (legacy)
Instructions to use permitt/galton-modernbertic-base with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use permitt/galton-modernbertic-base with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="permitt/galton-modernbertic-base")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("permitt/galton-modernbertic-base") model = AutoModelForMaskedLM.from_pretrained("permitt/galton-modernbertic-base") - Notebooks
- Google Colab
- Kaggle
| language: | |
| - sr | |
| - hr | |
| - bs | |
| - cnr | |
| license: apache-2.0 | |
| library_name: transformers | |
| tags: | |
| - modernbert | |
| - encoder | |
| - bcms | |
| - serbian | |
| - croatian | |
| - bosnian | |
| - montenegrin | |
| - south-slavic | |
| - fill-mask | |
| - long-context | |
| pipeline_tag: fill-mask | |
| datasets: | |
| - HPLT/HPLT3.0 | |
| - HuggingFaceFW/fineweb-2 | |
| - HuggingFaceFW/finepdfs | |
| - classla/xlm-r-bertic-data | |
| metrics: | |
| - accuracy | |
| - f1 | |
| model-index: | |
| - name: ModernBERTić-base | |
| results: | |
| - task: | |
| type: text-classification | |
| name: SuperGLUE-SR (BalkanBench v1.0) | |
| dataset: | |
| type: balkanbench/superglue-sr | |
| name: SuperGLUE-SR | |
| metrics: | |
| - type: average | |
| value: 69.73 | |
| name: Average (6 tasks, 5 seeds) | |
| - type: accuracy | |
| value: 76.02 | |
| name: BoolQ | |
| - type: f1_macro | |
| value: 76.96 | |
| name: CB | |
| - type: accuracy | |
| value: 65.76 | |
| name: COPA | |
| - type: accuracy | |
| value: 65.82 | |
| name: RTE | |
| - type: f1_a | |
| value: 66.90 | |
| name: MultiRC | |
| - type: accuracy | |
| value: 64.11 | |
| name: WSC | |
| <p align="center"> | |
| <img src="modernbertic.png" alt="ModernBERTić - the first modern long-context encoder for BCMS" width="100%"/> | |
| </p> | |
| # ModernBERTić-base | |
| **A modern-architecture encoder for Bosnian, Croatian, Montenegrin, and Serbian (BCMS).** 149M parameters, native 8192-token context, FlashAttention 2. | |
| > **For best downstream task performance, use the [large variant](https://huggingface.co/permitt/galton-modernbertic-large)** (395M, SOTA on SuperGLUE-SR). This base model is intended for fast inference, retrieval encoders where latency matters, and as a starting point for further pretraining or domain adaptation. | |
| ## TL;DR | |
| | | | | |
| |---|---| | |
| | **Architecture** | ModernBERT-base (22 layers, 768 hidden, 12 heads) | | |
| | **Parameters** | 149M | | |
| | **Context length** | 8192 tokens (RoPE base 160K) | | |
| | **Attention** | Sliding window 128 + global every 3rd layer, FlashAttention 2 | | |
| | **Tokenizer** | BPE, 50,304 vocab, Latin-only, cased (shared with `galton-modernbertic-large`) | | |
| | **Pretraining tokens** | 60B BCMS tokens, 22 sources | | |
| ## Honest performance note | |
| On SuperGLUE-SR ([BalkanBench v1.0](https://balkanbench.com/leaderboard)), ModernBERTić-base scores **69.73 average**, which sits **below BERTić's 71.46** at smaller parameter count. The story is consistent with what the literature predicts: | |
| > Masked Language Modeling at 30% masking gives a training signal on 30% of tokens. ELECTRA-style replaced-token detection (BERTić) gives a signal on 100% of tokens. At small capacities, that supervision deficit is not absorbed by architectural improvements. At larger capacities (see [`galton-modernbertic-large`](https://huggingface.co/permitt/galton-modernbertic-large)), it is. | |
| We are publishing the base model anyway because: | |
| 1. **Inference is much faster.** ~2-3× the throughput of the large variant on identical hardware, useful for high-volume retrieval encoders, candidate filtering, and embedding workloads. | |
| 2. **It is the right starting point for further pretraining.** If you are domain-adapting to legal, medical, or other specialized BCMS text, the base scale is the correct continued-pretraining target. | |
| 3. **It is an honest baseline material** for the encoder-comparison community working on BCMS. | |
| If you want SOTA on a downstream classification or reasoning task, **use the [large variant](https://huggingface.co/permitt/galton-modernbertic-large)**. If you want a small, fast, modern-architecture encoder for embedding work or further pretraining, this one is for you. | |
| ## Results: SuperGLUE Serbian edition | |
| Evaluation from [BalkanBench v1.0](https://balkanbench.com/leaderboard). 5 random seeds per cell, mean reported on the website; standard deviations in the leaderboard UI. | |
| Live, sortable leaderboard with all 9 evaluated models, per-task standard deviations, and reproducibility info: **[balkanbench.com/leaderboard](https://balkanbench.com/leaderboard)**. | |
| ## Quickstart | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForMaskedLM | |
| import torch | |
| model_id = "permitt/galton-modernbertic-base" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForMaskedLM.from_pretrained( | |
| model_id, | |
| attn_implementation="flash_attention_2", # falls back to sdpa if FA2 unavailable | |
| torch_dtype=torch.bfloat16, | |
| ).to("cuda") | |
| text = "Glavni grad Crne Gore je [MASK]." | |
| inputs = tokenizer(text, return_tensors="pt").to("cuda") | |
| with torch.no_grad(): | |
| logits = model(**inputs).logits | |
| mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1] | |
| predicted = tokenizer.decode(logits[0, mask_idx].argmax(dim=-1)) | |
| print(predicted) # "Podgorica" | |
| ``` | |
| ### Fine-tuning | |
| ```python | |
| from transformers import AutoModelForSequenceClassification | |
| model = AutoModelForSequenceClassification.from_pretrained( | |
| "permitt/galton-modernbertic-base", | |
| num_labels=3, | |
| attn_implementation="flash_attention_2", | |
| ) | |
| # standard HF Trainer flow from here | |
| ``` | |
| **Recommended hyperparameters:** | |
| | Task type | Learning rate | Batch size | Epochs | | |
| |-----------|---------------|------------|--------| | |
| | Sequence classification | 3e-5 to 7e-5 | 16-32 | 3-5 | | |
| | Token classification (NER, POS) | 5e-5 | 32 | 5-10 | | |
| The base model wants larger learning rates than the large variant (the loss landscape at smaller scale is less curved). A grid that works on this base does not transfer to ModernBERTić-large. | |
| ### Continued pretraining | |
| The base model is a reasonable starting point for domain adaptation. Use the same MLM objective at 30% masking, peak LR ~1e-4 (one decade below pretraining peak), warmup over the first 10% of your continued-pretraining tokens. Expect to need ~1-5B in-domain tokens depending on how distant your target domain is from web/news/encyclopedic text. | |
| ## Tokenizer | |
| Identical to the [large variant](https://huggingface.co/permitt/galton-modernbertic-large): BPE, 50,304 vocab, Latin-only, cased. **Tokens per character: 0.229** on held-out BCMS text, 31% lower than mmBERT's multilingual SentencePiece. Cyrillic input should be transliterated upstream. Cased input is preferred (uncased reduces tokenizer efficiency by ~14%). | |
| ## Pretraining | |
| Identical recipe to the large variant, scaled down to base configuration: | |
| - **Corpus:** 60B tokens, 227M documents, 22 BCMS sources, tiered priority, MinHash LSH cross-source deduplication. | |
| - **Objective:** Masked Language Modeling, 30% masking ratio. | |
| - **Optimizer:** AdamW, peak LR 8e-4 (higher than large because smaller model), warmup-stable-decay. | |
| - **Batch:** 4096 sequences global. | |
| - **Precision:** bfloat16. | |
| - **Framework:** MosaicML Composer + FlexBERT. | |
| See the [large variant card](https://huggingface.co/permitt/galton-modernbertic-large) for the detailed pipeline write-up. | |
| ## Intended uses and limitations | |
| **Intended uses.** | |
| - Fine-tuning starting point where inference latency matters more than peak accuracy. | |
| - Continued pretraining for domain-adapted BCMS encoders (legal, medical, technical). | |
| - Embedding model fine-tunes where 149M is the right size for the latency budget. | |
| - Token classification tasks (NER, POS) where the base scale is sufficient. | |
| **Out of scope.** | |
| - High-stakes downstream classification or reasoning tasks where you want SOTA. Use [`galton-modernbertic-large`](https://huggingface.co/permitt/galton-modernbertic-large) instead. | |
| - Generative tasks. This is an encoder, not a generative model. For text generation in BCMS, see the [national LLM initiative announced April 2026](https://www.srbija.gov.rs/) or general-purpose multilingual LLMs. | |
| - Languages outside BCMS. The tokenizer is Latin-only and the corpus is BCMS-only. | |
| **Limitations.** | |
| - **Latin script only.** Cyrillic input should be transliterated before tokenization. Raw Cyrillic falls back to byte-level encoding and burns context for no signal. | |
| - **Domain skew.** Training data is heavy on web text, news, encyclopedic content, and PDFs (academic + literary). Heavy code, conversational chat, or highly technical scientific text are underrepresented. | |
| - **Variants.** All four BCMS varieties (Bosnian, Croatian, Montenegrin, Serbian) are represented, but Croatian and Serbian dominate the corpus volume. Montenegrin in particular is upsampled 4× during mixing to compensate. | |
| ## Citation | |
| ```bibtex | |
| @misc{perovic2026modernbertic, | |
| title = {{ModernBERTić}: A Modern Encoder for {BCMS} Languages}, | |
| author = {Perovic, Mitar}, | |
| year = {2026}, | |
| url = {https://huggingface.co/permitt/galton-modernbertic-base}, | |
| note = {Recrewty, EU-funded grant} | |
| } | |
| ``` | |
| ## Acknowledgments | |
| This work was developed at **[Recrewty](https://recrewty.com)** as part of an EU-funded grant. Compute on Leonardo HPC was provided through the consortium grant. | |
| Standing on the shoulders of: | |
| - **Nikola Ljubešić** and the **CLASSLA** team for BERTić, BENCHić, and the broader BCMS NLP infrastructure that made this work possible. | |
| - The **ModernBERT** team (Warner et al., 2024) for the architecture and the FlexBERT codebase. | |
| - **MosaicML / Databricks** for Composer and the MDS streaming format. | |
| - **HuggingFace** for the model hub, datasets, and `tokenizers` library. | |
| - **JeRTeh**, **ReLDI**, and the broader Serbian NLP community for datasets and evaluation resources. | |
| - **EuroHPC** and the **Leonardo** consortium for compute access. | |
| ## See also | |
| - [`permitt/galton-modernbertic-large`](https://huggingface.co/permitt/galton-modernbertic-large) - 395M parameter variant, SOTA on SuperGLUE-SR | |
| - [BalkanBench leaderboard](https://balkanbench.com/leaderboard) - live evaluation across BCMS encoders | |
| - [Build-in-public series on LinkedIn](https://linkedin.com/in/perovicmitar) - posts #0-#9 covering training data, tokenizer, distributed training, debugging, and results | |
| - [Medium release post](https://medium.com/@permitt/training-the-first-modern-architecture-encoder-for-south-slavic-languages-e2a11a4ead31) - long-form write-up of the model, the data pipeline, and lessons on data quality vs data quantity (link active at release) | |
| - [All links in one place](https://permitt.io) - You can find linkedin material from this single point |