permitt's picture
Update README.md
2addc46 verified
metadata
language:
  - sr
  - hr
  - bs
  - cnr
license: apache-2.0
library_name: transformers
tags:
  - modernbert
  - encoder
  - bcms
  - serbian
  - croatian
  - bosnian
  - montenegrin
  - south-slavic
  - fill-mask
  - long-context
pipeline_tag: fill-mask
datasets:
  - HPLT/HPLT3.0
  - HuggingFaceFW/fineweb-2
  - HuggingFaceFW/finepdfs
  - classla/xlm-r-bertic-data
metrics:
  - accuracy
  - f1
model-index:
  - name: ModernBERTić-base
    results:
      - task:
          type: text-classification
          name: SuperGLUE-SR (BalkanBench v1.0)
        dataset:
          type: balkanbench/superglue-sr
          name: SuperGLUE-SR
        metrics:
          - type: average
            value: 69.73
            name: Average (6 tasks, 5 seeds)
          - type: accuracy
            value: 76.02
            name: BoolQ
          - type: f1_macro
            value: 76.96
            name: CB
          - type: accuracy
            value: 65.76
            name: COPA
          - type: accuracy
            value: 65.82
            name: RTE
          - type: f1_a
            value: 66.9
            name: MultiRC
          - type: accuracy
            value: 64.11
            name: WSC

ModernBERTić - the first modern long-context encoder for BCMS

ModernBERTić-base

A modern-architecture encoder for Bosnian, Croatian, Montenegrin, and Serbian (BCMS). 149M parameters, native 8192-token context, FlashAttention 2.

For best downstream task performance, use the large variant (395M, SOTA on SuperGLUE-SR). This base model is intended for fast inference, retrieval encoders where latency matters, and as a starting point for further pretraining or domain adaptation.

TL;DR

Architecture ModernBERT-base (22 layers, 768 hidden, 12 heads)
Parameters 149M
Context length 8192 tokens (RoPE base 160K)
Attention Sliding window 128 + global every 3rd layer, FlashAttention 2
Tokenizer BPE, 50,304 vocab, Latin-only, cased (shared with galton-modernbertic-large)
Pretraining tokens 60B BCMS tokens, 22 sources

Honest performance note

On SuperGLUE-SR (BalkanBench v1.0), ModernBERTić-base scores 69.73 average, which sits below BERTić's 71.46 at smaller parameter count. The story is consistent with what the literature predicts:

Masked Language Modeling at 30% masking gives a training signal on 30% of tokens. ELECTRA-style replaced-token detection (BERTić) gives a signal on 100% of tokens. At small capacities, that supervision deficit is not absorbed by architectural improvements. At larger capacities (see galton-modernbertic-large), it is.

We are publishing the base model anyway because:

  1. Inference is much faster. ~2-3× the throughput of the large variant on identical hardware, useful for high-volume retrieval encoders, candidate filtering, and embedding workloads.
  2. It is the right starting point for further pretraining. If you are domain-adapting to legal, medical, or other specialized BCMS text, the base scale is the correct continued-pretraining target.
  3. It is an honest baseline material for the encoder-comparison community working on BCMS.

If you want SOTA on a downstream classification or reasoning task, use the large variant. If you want a small, fast, modern-architecture encoder for embedding work or further pretraining, this one is for you.

Results: SuperGLUE Serbian edition

Evaluation from BalkanBench v1.0. 5 random seeds per cell, mean reported on the website; standard deviations in the leaderboard UI.

Live, sortable leaderboard with all 9 evaluated models, per-task standard deviations, and reproducibility info: balkanbench.com/leaderboard.

Quickstart

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

model_id = "permitt/galton-modernbertic-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
    model_id,
    attn_implementation="flash_attention_2",   # falls back to sdpa if FA2 unavailable
    torch_dtype=torch.bfloat16,
).to("cuda")

text = "Glavni grad Crne Gore je [MASK]."
inputs = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
    logits = model(**inputs).logits

mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted = tokenizer.decode(logits[0, mask_idx].argmax(dim=-1))
print(predicted)  # "Podgorica"

Fine-tuning

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "permitt/galton-modernbertic-base",
    num_labels=3,
    attn_implementation="flash_attention_2",
)
# standard HF Trainer flow from here

Recommended hyperparameters:

Task type Learning rate Batch size Epochs
Sequence classification 3e-5 to 7e-5 16-32 3-5
Token classification (NER, POS) 5e-5 32 5-10

The base model wants larger learning rates than the large variant (the loss landscape at smaller scale is less curved). A grid that works on this base does not transfer to ModernBERTić-large.

Continued pretraining

The base model is a reasonable starting point for domain adaptation. Use the same MLM objective at 30% masking, peak LR ~1e-4 (one decade below pretraining peak), warmup over the first 10% of your continued-pretraining tokens. Expect to need ~1-5B in-domain tokens depending on how distant your target domain is from web/news/encyclopedic text.

Tokenizer

Identical to the large variant: BPE, 50,304 vocab, Latin-only, cased. Tokens per character: 0.229 on held-out BCMS text, 31% lower than mmBERT's multilingual SentencePiece. Cyrillic input should be transliterated upstream. Cased input is preferred (uncased reduces tokenizer efficiency by ~14%).

Pretraining

Identical recipe to the large variant, scaled down to base configuration:

  • Corpus: 60B tokens, 227M documents, 22 BCMS sources, tiered priority, MinHash LSH cross-source deduplication.
  • Objective: Masked Language Modeling, 30% masking ratio.
  • Optimizer: AdamW, peak LR 8e-4 (higher than large because smaller model), warmup-stable-decay.
  • Batch: 4096 sequences global.
  • Precision: bfloat16.
  • Framework: MosaicML Composer + FlexBERT.

See the large variant card for the detailed pipeline write-up.

Intended uses and limitations

Intended uses.

  • Fine-tuning starting point where inference latency matters more than peak accuracy.
  • Continued pretraining for domain-adapted BCMS encoders (legal, medical, technical).
  • Embedding model fine-tunes where 149M is the right size for the latency budget.
  • Token classification tasks (NER, POS) where the base scale is sufficient.

Out of scope.

  • High-stakes downstream classification or reasoning tasks where you want SOTA. Use galton-modernbertic-large instead.
  • Generative tasks. This is an encoder, not a generative model. For text generation in BCMS, see the national LLM initiative announced April 2026 or general-purpose multilingual LLMs.
  • Languages outside BCMS. The tokenizer is Latin-only and the corpus is BCMS-only.

Limitations.

  • Latin script only. Cyrillic input should be transliterated before tokenization. Raw Cyrillic falls back to byte-level encoding and burns context for no signal.
  • Domain skew. Training data is heavy on web text, news, encyclopedic content, and PDFs (academic + literary). Heavy code, conversational chat, or highly technical scientific text are underrepresented.
  • Variants. All four BCMS varieties (Bosnian, Croatian, Montenegrin, Serbian) are represented, but Croatian and Serbian dominate the corpus volume. Montenegrin in particular is upsampled 4× during mixing to compensate.

Citation

@misc{perovic2026modernbertic,
  title  = {{ModernBERTić}: A Modern Encoder for {BCMS} Languages},
  author = {Perovic, Mitar},
  year   = {2026},
  url    = {https://huggingface.co/permitt/galton-modernbertic-base},
  note   = {Recrewty, EU-funded grant}
}

Acknowledgments

This work was developed at Recrewty as part of an EU-funded grant. Compute on Leonardo HPC was provided through the consortium grant.

Standing on the shoulders of:

  • Nikola Ljubešić and the CLASSLA team for BERTić, BENCHić, and the broader BCMS NLP infrastructure that made this work possible.
  • The ModernBERT team (Warner et al., 2024) for the architecture and the FlexBERT codebase.
  • MosaicML / Databricks for Composer and the MDS streaming format.
  • HuggingFace for the model hub, datasets, and tokenizers library.
  • JeRTeh, ReLDI, and the broader Serbian NLP community for datasets and evaluation resources.
  • EuroHPC and the Leonardo consortium for compute access.

See also