Update README.md

2addc46 verified 24 days ago

10.1 kB

language:
  - sr
  - hr
  - bs
  - cnr
license: apache-2.0
library_name: transformers
tags:
  - modernbert
  - encoder
  - bcms
  - serbian
  - croatian
  - bosnian
  - montenegrin
  - south-slavic
  - fill-mask
  - long-context
pipeline_tag: fill-mask
datasets:
  - HPLT/HPLT3.0
  - HuggingFaceFW/fineweb-2
  - HuggingFaceFW/finepdfs
  - classla/xlm-r-bertic-data
metrics:
  - accuracy
  - f1
model-index:
  - name: ModernBERTić-base
    results:
      - task:
          type: text-classification
          name: SuperGLUE-SR (BalkanBench v1.0)
        dataset:
          type: balkanbench/superglue-sr
          name: SuperGLUE-SR
        metrics:
          - type: average
            value: 69.73
            name: Average (6 tasks, 5 seeds)
          - type: accuracy
            value: 76.02
            name: BoolQ
          - type: f1_macro
            value: 76.96
            name: CB
          - type: accuracy
            value: 65.76
            name: COPA
          - type: accuracy
            value: 65.82
            name: RTE
          - type: f1_a
            value: 66.9
            name: MultiRC
          - type: accuracy
            value: 64.11
            name: WSC

ModernBERTić - the first modern long-context encoder for BCMS

ModernBERTić-base

A modern-architecture encoder for Bosnian, Croatian, Montenegrin, and Serbian (BCMS). 149M parameters, native 8192-token context, FlashAttention 2.

For best downstream task performance, use the large variant (395M, SOTA on SuperGLUE-SR). This base model is intended for fast inference, retrieval encoders where latency matters, and as a starting point for further pretraining or domain adaptation.

TL;DR


Architecture	ModernBERT-base (22 layers, 768 hidden, 12 heads)
Parameters	149M
Context length	8192 tokens (RoPE base 160K)
Attention	Sliding window 128 + global every 3rd layer, FlashAttention 2
Tokenizer	BPE, 50,304 vocab, Latin-only, cased (shared with `galton-modernbertic-large`)
Pretraining tokens	60B BCMS tokens, 22 sources

Honest performance note

On SuperGLUE-SR (BalkanBench v1.0), ModernBERTić-base scores 69.73 average, which sits below BERTić's 71.46 at smaller parameter count. The story is consistent with what the literature predicts:

Masked Language Modeling at 30% masking gives a training signal on 30% of tokens. ELECTRA-style replaced-token detection (BERTić) gives a signal on 100% of tokens. At small capacities, that supervision deficit is not absorbed by architectural improvements. At larger capacities (see galton-modernbertic-large), it is.

We are publishing the base model anyway because:

Inference is much faster. ~2-3× the throughput of the large variant on identical hardware, useful for high-volume retrieval encoders, candidate filtering, and embedding workloads.
It is the right starting point for further pretraining. If you are domain-adapting to legal, medical, or other specialized BCMS text, the base scale is the correct continued-pretraining target.
It is an honest baseline material for the encoder-comparison community working on BCMS.

If you want SOTA on a downstream classification or reasoning task, use the large variant. If you want a small, fast, modern-architecture encoder for embedding work or further pretraining, this one is for you.

Results: SuperGLUE Serbian edition

Evaluation from BalkanBench v1.0. 5 random seeds per cell, mean reported on the website; standard deviations in the leaderboard UI.

Live, sortable leaderboard with all 9 evaluated models, per-task standard deviations, and reproducibility info: balkanbench.com/leaderboard.

Quickstart

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

model_id = "permitt/galton-modernbertic-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(
    model_id,
    attn_implementation="flash_attention_2",   # falls back to sdpa if FA2 unavailable
    torch_dtype=torch.bfloat16,
).to("cuda")

text = "Glavni grad Crne Gore je [MASK]."
inputs = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
    logits = model(**inputs).logits

mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
predicted = tokenizer.decode(logits[0, mask_idx].argmax(dim=-1))
print(predicted)  # "Podgorica"

Fine-tuning

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained(
    "permitt/galton-modernbertic-base",
    num_labels=3,
    attn_implementation="flash_attention_2",
)
# standard HF Trainer flow from here

Recommended hyperparameters:

Task type	Learning rate	Batch size	Epochs
Sequence classification	3e-5 to 7e-5	16-32	3-5
Token classification (NER, POS)	5e-5	32	5-10

The base model wants larger learning rates than the large variant (the loss landscape at smaller scale is less curved). A grid that works on this base does not transfer to ModernBERTić-large.

Continued pretraining

The base model is a reasonable starting point for domain adaptation. Use the same MLM objective at 30% masking, peak LR ~1e-4 (one decade below pretraining peak), warmup over the first 10% of your continued-pretraining tokens. Expect to need ~1-5B in-domain tokens depending on how distant your target domain is from web/news/encyclopedic text.

Tokenizer

Identical to the large variant: BPE, 50,304 vocab, Latin-only, cased. Tokens per character: 0.229 on held-out BCMS text, 31% lower than mmBERT's multilingual SentencePiece. Cyrillic input should be transliterated upstream. Cased input is preferred (uncased reduces tokenizer efficiency by ~14%).

Pretraining

Identical recipe to the large variant, scaled down to base configuration:

Corpus: 60B tokens, 227M documents, 22 BCMS sources, tiered priority, MinHash LSH cross-source deduplication.
Objective: Masked Language Modeling, 30% masking ratio.
Optimizer: AdamW, peak LR 8e-4 (higher than large because smaller model), warmup-stable-decay.
Batch: 4096 sequences global.
Precision: bfloat16.
Framework: MosaicML Composer + FlexBERT.

See the large variant card for the detailed pipeline write-up.

Intended uses and limitations

Intended uses.

Fine-tuning starting point where inference latency matters more than peak accuracy.
Continued pretraining for domain-adapted BCMS encoders (legal, medical, technical).
Embedding model fine-tunes where 149M is the right size for the latency budget.
Token classification tasks (NER, POS) where the base scale is sufficient.

Out of scope.

High-stakes downstream classification or reasoning tasks where you want SOTA. Use galton-modernbertic-large instead.
Generative tasks. This is an encoder, not a generative model. For text generation in BCMS, see the national LLM initiative announced April 2026 or general-purpose multilingual LLMs.
Languages outside BCMS. The tokenizer is Latin-only and the corpus is BCMS-only.

Limitations.

Latin script only. Cyrillic input should be transliterated before tokenization. Raw Cyrillic falls back to byte-level encoding and burns context for no signal.
Domain skew. Training data is heavy on web text, news, encyclopedic content, and PDFs (academic + literary). Heavy code, conversational chat, or highly technical scientific text are underrepresented.
Variants. All four BCMS varieties (Bosnian, Croatian, Montenegrin, Serbian) are represented, but Croatian and Serbian dominate the corpus volume. Montenegrin in particular is upsampled 4× during mixing to compensate.

Citation

@misc{perovic2026modernbertic,
  title  = {{ModernBERTić}: A Modern Encoder for {BCMS} Languages},
  author = {Perovic, Mitar},
  year   = {2026},
  url    = {https://huggingface.co/permitt/galton-modernbertic-base},
  note   = {Recrewty, EU-funded grant}
}

Acknowledgments

This work was developed at Recrewty as part of an EU-funded grant. Compute on Leonardo HPC was provided through the consortium grant.

Standing on the shoulders of:

Nikola Ljubešić and the CLASSLA team for BERTić, BENCHić, and the broader BCMS NLP infrastructure that made this work possible.
The ModernBERT team (Warner et al., 2024) for the architecture and the FlexBERT codebase.
MosaicML / Databricks for Composer and the MDS streaming format.
HuggingFace for the model hub, datasets, and tokenizers library.
JeRTeh, ReLDI, and the broader Serbian NLP community for datasets and evaluation resources.
EuroHPC and the Leonardo consortium for compute access.

permitt
/

galton-modernbertic-base