Update README.md

2addc46 verified 24 days ago

10.1 kB

	---
	language:
	- sr
	- hr
	- bs
	- cnr
	license: apache-2.0
	library_name: transformers
	tags:
	- modernbert
	- encoder
	- bcms
	- serbian
	- croatian
	- bosnian
	- montenegrin
	- south-slavic
	- fill-mask
	- long-context
	pipeline_tag: fill-mask
	datasets:
	- HPLT/HPLT3.0
	- HuggingFaceFW/fineweb-2
	- HuggingFaceFW/finepdfs
	- classla/xlm-r-bertic-data
	metrics:
	- accuracy
	- f1
	model-index:
	- name: ModernBERTić-base
	results:
	- task:
	type: text-classification
	name: SuperGLUE-SR (BalkanBench v1.0)
	dataset:
	type: balkanbench/superglue-sr
	name: SuperGLUE-SR
	metrics:
	- type: average
	value: 69.73
	name: Average (6 tasks, 5 seeds)
	- type: accuracy
	value: 76.02
	name: BoolQ
	- type: f1_macro
	value: 76.96
	name: CB
	- type: accuracy
	value: 65.76
	name: COPA
	- type: accuracy
	value: 65.82
	name: RTE
	- type: f1_a
	value: 66.90
	name: MultiRC
	- type: accuracy
	value: 64.11
	name: WSC
	---

	<p align="center">
	<img src="modernbertic.png" alt="ModernBERTić - the first modern long-context encoder for BCMS" width="100%"/>
	</p>

	# ModernBERTić-base

	A modern-architecture encoder for Bosnian, Croatian, Montenegrin, and Serbian (BCMS). 149M parameters, native 8192-token context, FlashAttention 2.

	> For best downstream task performance, use the [large variant](https://huggingface.co/permitt/galton-modernbertic-large) (395M, SOTA on SuperGLUE-SR). This base model is intended for fast inference, retrieval encoders where latency matters, and as a starting point for further pretraining or domain adaptation.

	## TL;DR

	\| \| \|
	\|---\|---\|
	\| Architecture \| ModernBERT-base (22 layers, 768 hidden, 12 heads) \|
	\| Parameters \| 149M \|
	\| Context length \| 8192 tokens (RoPE base 160K) \|
	\| Attention \| Sliding window 128 + global every 3rd layer, FlashAttention 2 \|
	\| Tokenizer \| BPE, 50,304 vocab, Latin-only, cased (shared with `galton-modernbertic-large`) \|
	\| Pretraining tokens \| 60B BCMS tokens, 22 sources \|

	## Honest performance note

	On SuperGLUE-SR ([BalkanBench v1.0](https://balkanbench.com/leaderboard)), ModernBERTić-base scores 69.73 average, which sits below BERTić's 71.46 at smaller parameter count. The story is consistent with what the literature predicts:

	> Masked Language Modeling at 30% masking gives a training signal on 30% of tokens. ELECTRA-style replaced-token detection (BERTić) gives a signal on 100% of tokens. At small capacities, that supervision deficit is not absorbed by architectural improvements. At larger capacities (see [`galton-modernbertic-large`](https://huggingface.co/permitt/galton-modernbertic-large)), it is.

	We are publishing the base model anyway because:

	1. Inference is much faster. ~2-3× the throughput of the large variant on identical hardware, useful for high-volume retrieval encoders, candidate filtering, and embedding workloads.
	2. It is the right starting point for further pretraining. If you are domain-adapting to legal, medical, or other specialized BCMS text, the base scale is the correct continued-pretraining target.
	3. It is an honest baseline material for the encoder-comparison community working on BCMS.

	If you want SOTA on a downstream classification or reasoning task, use the [large variant](https://huggingface.co/permitt/galton-modernbertic-large). If you want a small, fast, modern-architecture encoder for embedding work or further pretraining, this one is for you.

	## Results: SuperGLUE Serbian edition

	Evaluation from [BalkanBench v1.0](https://balkanbench.com/leaderboard). 5 random seeds per cell, mean reported on the website; standard deviations in the leaderboard UI.

	Live, sortable leaderboard with all 9 evaluated models, per-task standard deviations, and reproducibility info: [balkanbench.com/leaderboard](https://balkanbench.com/leaderboard).

	## Quickstart

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM
	import torch

	model_id = "permitt/galton-modernbertic-base"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForMaskedLM.from_pretrained(
	model_id,
	attn_implementation="flash_attention_2", # falls back to sdpa if FA2 unavailable
	torch_dtype=torch.bfloat16,
	).to("cuda")

	text = "Glavni grad Crne Gore je [MASK]."
	inputs = tokenizer(text, return_tensors="pt").to("cuda")
	with torch.no_grad():
	logits = model(**inputs).logits

	mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
	predicted = tokenizer.decode(logits[0, mask_idx].argmax(dim=-1))
	print(predicted) # "Podgorica"
	```

	### Fine-tuning

	```python
	from transformers import AutoModelForSequenceClassification

	model = AutoModelForSequenceClassification.from_pretrained(
	"permitt/galton-modernbertic-base",
	num_labels=3,
	attn_implementation="flash_attention_2",
	)
	# standard HF Trainer flow from here
	```

	Recommended hyperparameters:

	\| Task type \| Learning rate \| Batch size \| Epochs \|
	\|-----------\|---------------\|------------\|--------\|
	\| Sequence classification \| 3e-5 to 7e-5 \| 16-32 \| 3-5 \|
	\| Token classification (NER, POS) \| 5e-5 \| 32 \| 5-10 \|

	The base model wants larger learning rates than the large variant (the loss landscape at smaller scale is less curved). A grid that works on this base does not transfer to ModernBERTić-large.

	### Continued pretraining

	The base model is a reasonable starting point for domain adaptation. Use the same MLM objective at 30% masking, peak LR ~1e-4 (one decade below pretraining peak), warmup over the first 10% of your continued-pretraining tokens. Expect to need ~1-5B in-domain tokens depending on how distant your target domain is from web/news/encyclopedic text.

	## Tokenizer

	Identical to the [large variant](https://huggingface.co/permitt/galton-modernbertic-large): BPE, 50,304 vocab, Latin-only, cased. Tokens per character: 0.229 on held-out BCMS text, 31% lower than mmBERT's multilingual SentencePiece. Cyrillic input should be transliterated upstream. Cased input is preferred (uncased reduces tokenizer efficiency by ~14%).

	## Pretraining

	Identical recipe to the large variant, scaled down to base configuration:

	- Corpus: 60B tokens, 227M documents, 22 BCMS sources, tiered priority, MinHash LSH cross-source deduplication.
	- Objective: Masked Language Modeling, 30% masking ratio.
	- Optimizer: AdamW, peak LR 8e-4 (higher than large because smaller model), warmup-stable-decay.
	- Batch: 4096 sequences global.
	- Precision: bfloat16.
	- Framework: MosaicML Composer + FlexBERT.

	See the [large variant card](https://huggingface.co/permitt/galton-modernbertic-large) for the detailed pipeline write-up.

	## Intended uses and limitations

	Intended uses.
	- Fine-tuning starting point where inference latency matters more than peak accuracy.
	- Continued pretraining for domain-adapted BCMS encoders (legal, medical, technical).
	- Embedding model fine-tunes where 149M is the right size for the latency budget.
	- Token classification tasks (NER, POS) where the base scale is sufficient.

	Out of scope.
	- High-stakes downstream classification or reasoning tasks where you want SOTA. Use [`galton-modernbertic-large`](https://huggingface.co/permitt/galton-modernbertic-large) instead.
	- Generative tasks. This is an encoder, not a generative model. For text generation in BCMS, see the [national LLM initiative announced April 2026](https://www.srbija.gov.rs/) or general-purpose multilingual LLMs.
	- Languages outside BCMS. The tokenizer is Latin-only and the corpus is BCMS-only.

	Limitations.
	- Latin script only. Cyrillic input should be transliterated before tokenization. Raw Cyrillic falls back to byte-level encoding and burns context for no signal.
	- Domain skew. Training data is heavy on web text, news, encyclopedic content, and PDFs (academic + literary). Heavy code, conversational chat, or highly technical scientific text are underrepresented.
	- Variants. All four BCMS varieties (Bosnian, Croatian, Montenegrin, Serbian) are represented, but Croatian and Serbian dominate the corpus volume. Montenegrin in particular is upsampled 4× during mixing to compensate.

	## Citation

	```bibtex
	@misc{perovic2026modernbertic,
	title = {{ModernBERTić}: A Modern Encoder for {BCMS} Languages},
	author = {Perovic, Mitar},
	year = {2026},
	url = {https://huggingface.co/permitt/galton-modernbertic-base},
	note = {Recrewty, EU-funded grant}
	}
	```

	## Acknowledgments

	This work was developed at [Recrewty](https://recrewty.com) as part of an EU-funded grant. Compute on Leonardo HPC was provided through the consortium grant.

	Standing on the shoulders of:
	- Nikola Ljubešić and the CLASSLA team for BERTić, BENCHić, and the broader BCMS NLP infrastructure that made this work possible.
	- The ModernBERT team (Warner et al., 2024) for the architecture and the FlexBERT codebase.
	- MosaicML / Databricks for Composer and the MDS streaming format.
	- HuggingFace for the model hub, datasets, and `tokenizers` library.
	- JeRTeh, ReLDI, and the broader Serbian NLP community for datasets and evaluation resources.
	- EuroHPC and the Leonardo consortium for compute access.

	## See also

	- [`permitt/galton-modernbertic-large`](https://huggingface.co/permitt/galton-modernbertic-large) - 395M parameter variant, SOTA on SuperGLUE-SR
	- [BalkanBench leaderboard](https://balkanbench.com/leaderboard) - live evaluation across BCMS encoders
	- [Build-in-public series on LinkedIn](https://linkedin.com/in/perovicmitar) - posts #0-#9 covering training data, tokenizer, distributed training, debugging, and results
	- [Medium release post](https://medium.com/@permitt/training-the-first-modern-architecture-encoder-for-south-slavic-languages-e2a11a4ead31) - long-form write-up of the model, the data pipeline, and lessons on data quality vs data quantity (link active at release)
	- [All links in one place](https://permitt.io) - You can find linkedin material from this single point