Update README.md

59a426c verified 24 days ago

9.52 kB

	---
	language:
	- sr
	- hr
	- bs
	- cnr
	license: apache-2.0
	library_name: transformers
	tags:
	- modernbert
	- encoder
	- bcms
	- serbian
	- croatian
	- bosnian
	- montenegrin
	- south-slavic
	- fill-mask
	- long-context
	pipeline_tag: fill-mask
	datasets:
	- HPLT/HPLT3.0
	- HuggingFaceFW/fineweb-2
	- HuggingFaceFW/finepdfs
	- classla/xlm-r-bertic-data
	metrics:
	- accuracy
	- f1
	model-index:
	- name: ModernBERTić-large
	results:
	- task:
	type: text-classification
	name: SuperGLUE-SR (BalkanBench v1.0)
	dataset:
	type: balkanbench/superglue-sr
	name: SuperGLUE-SR
	metrics:
	- type: average
	value: 73.44
	name: Average (6 tasks, 5 seeds)
	- type: accuracy
	value: 80.70
	name: BoolQ
	- type: f1_macro
	value: 78.52
	name: CB
	- type: accuracy
	value: 76.84
	name: COPA
	- type: accuracy
	value: 73.13
	name: RTE
	- type: f1_a
	value: 67.90
	name: MultiRC
	- type: accuracy
	value: 63.56
	name: WSC
	---

	<p align="center">
	<img src="modernbertic.png" alt="ModernBERTić - the first modern long-context encoder for BCMS" width="100%"/>
	</p>

	# ModernBERTić-large

	The first modern-architecture encoder for Bosnian, Croatian, Montenegrin, and Serbian (BCMS). 395M parameters, native 8192-token context, FlashAttention 2.

	State of the art on SuperGLUE-SR. Live leaderboard: [balkanbench.com/leaderboard](https://balkanbench.com/leaderboard).

	> Looking for the smaller variant? See [`permitt/galton-modernbertic-base`](https://huggingface.co/permitt/galton-modernbertic-base) (149M params).

	## TL;DR

	\| \| \|
	\|---\|---\|
	\| Architecture \| ModernBERT-large (28 layers, 1024 hidden, 16 heads) \|
	\| Parameters \| 395M \|
	\| Context length \| 8192 tokens (RoPE base 160K) \|
	\| Attention \| Sliding window 256 + global every 2nd layer, FlashAttention 2 \|
	\| Tokenizer \| BPE, 50,304 vocab, Latin-only, cased \|
	\| Pretraining tokens \| 66B BCMS tokens, 22 sources \|
	\| Compute \| 64× A100-64GB on Leonardo HPC, ~10h wall clock \|

	## Why this model exists

	The de facto encoder for BCMS has been [`classla/bcms-bertic`](https://huggingface.co/classla/bcms-bertic) since 2021: 110M parameters, 512-token context, ELECTRA. Excellent within its envelope. Insufficient for production tasks that require long-document understanding (CV parsing, legal documents, retrieval over knowledge bases).

	ModernBERTić ports the [ModernBERT](https://huggingface.co/blog/modernbert) recipe to BCMS:

	- 8K native context instead of 512, via RoPE
	- FlashAttention 2 + unpadding for ~3.5× faster inference at identical hardware
	- Alternating attention (sliding window 256 + full attention every 2nd layer) for O(n) cost on long inputs
	- Latin-only BCMS-native tokenizer at 31% lower than mmBERT's multilingual SentencePiece

	## Results: SuperGLUE Serbian edition

	Evaluation from [BalkanBench v1.0](https://balkanbench.com/leaderboard). 5 random seeds per cell, mean reported om the website; standard deviations in the leaderboard UI.


	Live, sortable leaderboard with all 9 evaluated models, per-task standard deviations, and reproducibility info: [balkanbench.com/leaderboard](https://balkanbench.com/leaderboard).

	## Quickstart

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM
	import torch

	model_id = "permitt/galton-modernbertic-large"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForMaskedLM.from_pretrained(
	model_id,
	attn_implementation="flash_attention_2", # falls back to sdpa if FA2 unavailable
	torch_dtype=torch.bfloat16,
	).to("cuda")

	text = "Glavni grad Crne Gore je [MASK]."
	inputs = tokenizer(text, return_tensors="pt").to("cuda")
	with torch.no_grad():
	logits = model(**inputs).logits

	mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
	predicted = tokenizer.decode(logits[0, mask_idx].argmax(dim=-1))
	print(predicted)
	```

	### Long context

	```python
	# 8192 tokens supported natively, no positional interpolation needed
	tokenizer.model_max_length = 8192
	inputs = tokenizer(very_long_document, return_tensors="pt", truncation=True).to("cuda")
	```

	### Fine-tuning

	```python
	from transformers import AutoModelForSequenceClassification

	model = AutoModelForSequenceClassification.from_pretrained(
	"permitt/galton-modernbertic-large",
	num_labels=3,
	attn_implementation="flash_attention_2",
	)
	# standard HF Trainer flow from here
	```

	Recommended hyperparameters (from our SuperGLUE-SR sweeps):

	\| Task type \| Learning rate \| Batch size \| Epochs \|
	\|-----------\|---------------\|------------\|--------\|
	\| Sequence classification \| 2e-5 to 5e-5 \| 16-32 \| 3-5 \|
	\| Token classification (NER, POS) \| 3e-5 \| 32 \| 5-10 \|
	\| Long-context tasks (>512 tok) \| 1e-5 to 3e-5 \| 8-16 \| 3-5 \|



	## Tokenizer

	\| \| Vocab \| Tokens / character \| OOV rate \|
	\|---\|-------\|--------------------\|---------\|
	\| ModernBERTić \| 50,304 \| 0.229 \| 0.000% \|
	\| BERTić \| 32,000 \| 0.242 \| 0.006% \|
	\| XLM-R-BERTić \| 250,002 \| 0.274 \| 0.008% \|
	\| mmBERT \| 256,000 \| 0.334 \| 0.000% \|

	Measured on 55.8M characters of held-out BCMS text.

	The vocabulary is Latin-only. Cyrillic input should be transliterated upstream; Cased input is preferred (uncased reduces tokenizer efficiency by ~14%).

	## Pretraining

	- Corpus: 66B tokens, 227M documents, assembled from 22 BCMS sources (FineWiki, BERTić-MaCoCu, FineWeb-2, HPLT 3.0, FinePDFs, CLASSLA web, books, news, plus others). Tiered priority, BCMS-specific quality filters (gambling/content-farm/stop-word heuristics), MinHash LSH cross-source deduplication at 0.8 Jaccard threshold.
	- Objective: Masked Language Modeling, 30% masking ratio.
	- Optimizer: AdamW, peak LR 5e-4, warmup-stable-decay schedule with ~9% decay phase.
	- Batch: 4096 sequences global, kept constant across GPU counts (strong scaling).
	- Precision: bfloat16.
	- Framework: [MosaicML Composer](https://github.com/mosaicml/composer) + [FlexBERT](https://github.com/AnswerDotAI/ModernBERT). MDS streaming dataset format with deterministic resume across the 24-hour Leonardo job limit.


	## Intended uses and limitations

	Intended uses. Sequence classification, token classification (NER, POS), masked language modeling, long-document understanding, and as a base model for fine-tuned retrievers and rerankers (see EmbedBERTić and RerankerBERTić, releasing soon).

	Out of scope. This is an encoder, not a generative model. It does not produce open-ended text. For text generation in BCMS, see the [national LLM initiative announced April 2026](https://www.srbija.gov.rs/) or general-purpose multilingual LLMs.

	Limitations.
	- Latin script only. Cyrillic input should be transliterated before tokenization. Raw Cyrillic falls back to byte-level encoding and burns context for no signal.
	- Domain skew. Training data is heavy on web text, news, encyclopedic content, and PDFs (academic + literary). Heavy code, conversational chat, or highly technical scientific text are underrepresented.
	- Variants. All four BCMS varieties (Bosnian, Croatian, Montenegrin, Serbian) are represented, but Croatian and Serbian dominate the corpus volume. Montenegrin in particular is upsampled 4× during mixing to compensate.
	## Production note

	ModernBERTić powers production features at [Recrewty](https://recrewty.com), an AI-assisted talent management platform for the Balkans, including long-document CV understanding, psychometric inference, and the candidate retrieve-then-rerank pipeline. The model is the same artifact in production and on this card; nothing is held back.

	## Citation

	```bibtex
	@misc{perovic2026modernbertic,
	title = {{ModernBERTić}: A Modern Encoder for {BCMS} Languages},
	author = {Perovic, Mitar},
	year = {2026},
	url = {https://huggingface.co/permitt/galton-modernbertic-large},
	note = {Recrewty, EU-funded grant}
	}
	```

	## Acknowledgments

	This work was developed at [Recrewty](https://recrewty.com) as part of an EU-funded grant. Compute on Leonardo HPC was provided through the consortium grant.

	Standing on the shoulders of:
	- Nikola Ljubešić and the CLASSLA team for BERTić, BENCHić, and the broader BCMS NLP infrastructure that made this work possible.
	- The ModernBERT team (Warner et al., 2024) for the architecture and the FlexBERT codebase.
	- MosaicML / Databricks for Composer and the MDS streaming format.
	- HuggingFace for the model hub, datasets, and `tokenizers` library.
	- JeRTeh, ReLDI, and the broader Serbian NLP community for datasets and evaluation resources.
	- EuroHPC and the Leonardo consortium for compute access.
	## See also

	- [`permitt/galton-modernbertic-base`](https://huggingface.co/permitt/galton-modernbertic-base) - 149M parameter variant
	- [BalkanBench leaderboard](https://balkanbench.com/leaderboard) - live evaluation across BCMS encoders
	- [Build-in-public series on LinkedIn](https://linkedin.com/in/perovicmitar) - posts #0-#9 covering training data, tokenizer, distributed training, debugging, and results
	- [Medium release post](https://medium.com/@permitt/training-the-first-modern-architecture-encoder-for-south-slavic-languages-e2a11a4ead31) - long-form write-up of the model, the data pipeline, and lessons on data quality vs data quantity (link active at release)
	- [All links in one place](https://permitt.io) - You can find linkedin material from this single point