SciLaD-M (custom)

SciLaD-M-custom is a RoBERTa-base encoder pre-trained from scratch on the English split of the SciLaD corpus of scientific publications. Both the model weights and the tokenizer are learned from SciLaD, yielding a vocabulary tailored to scientific writing. It is intended as a drop-in domain-specific replacement for RoBERTa / SciBERT for fine-tuning on scientific NLP tasks.

A sibling model, scilons/SciLaD-M-roberta, reuses the original RoBERTa tokenizer instead.

Intended use

Primary: fine-tuning on scientific-text tasks — NER (biomedical, CS), PICO extraction, relation classification, citation-intent and field-of-research classification, dependency parsing.
Also: masked-token fill-in, contextual embeddings for retrieval or clustering of scientific documents.
Out of scope: text generation (this is an encoder, not a generative LM), instruction following, non-English text, high-stakes clinical/diagnostic use without further validation.

How to use

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

tok = AutoTokenizer.from_pretrained("scilons/SciLaD-M-custom")
model = AutoModelForMaskedLM.from_pretrained("scilons/SciLaD-M-custom")

fill = pipeline("fill-mask", model=model, tokenizer=tok)
fill(f"Transformer models achieve state-of-the-art {tok.mask_token} on many NLP benchmarks.")

For fine-tuning on downstream tasks, load the appropriate task head, e.g. AutoModelForTokenClassification for NER or AutoModelForSequenceClassification for classification:

from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    "scilons/SciLaD-M-custom", num_labels=NUM_LABELS
)

Training data

Pre-trained on the English clean plain-text split of SciLaD, published on the Hugging Face Hub as scilons/SciLaD-all-text-v1: 10.99M scientific publications (68.7B tokens, avg. 6k tokens/doc) harvested from open-access sources via an entirely open-source pipeline:

Unpaywall snapshot (14 June 2023) → PDFs → TEI XML with Grobid
arXiv → LaTeX → TEI XML with LaTeXML
PMC / PLOS → JATS XML → TEI XML with Pub2TEI
Language filtering (weighted fastText ≥ 0.8 English), Gopher quality heuristics, and MinHash near-duplicate removal (5-grams, 112 hash functions, 75% Jaccard threshold) via Datatrove.

See the paper for full dataset statistics and the open pipeline.

Training procedure

Architecture: RoBERTa-base — 12 Transformer layers, ~110M parameters.
Objective: Masked Language Modeling (MLM).
Tokenizer: Byte-pair encoding trained from scratch on SciLaD; vocabulary size 50,265 (matching the original RoBERTa tokenizer for comparability).
Batching: batch size 256 × 2 gradient-accumulation steps → effective batch size 1024.
Hardware: 2 × NVIDIA H200 80GB.
Data split: 80 / 10 / 10 (train / validation / test) at document level.
Hyperparameters: otherwise as in the original RoBERTa configuration.

Evaluation

Fine-tuned on 11 scientific-NLP benchmarks (5 task families); F1 unless noted, averaged over 3 seeds. A compact view of SciLaD-M-custom vs. SciBERT and SciLaD-M-roberta:

Task	Dataset	SciLaD-M-custom	SciLaD-M-roberta	SciBERT
NER	BC5CDR	97.20	96.54	95.07
NER	JNLPBA	93.89	93.50	94.19
NER	NCBI-disease	91.48	91.63	87.09
NER	SciERC	41.03	42.49	54.39
PICO	EBM-NLP	77.66	78.47	78.14
REL	ChemProt (micro)	83.83	82.64	82.83
REL	SciERC	81.23	72.10	77.10
CLS	CitationIntent	64.46	59.08	47.33
CLS	MAG	73.01	73.59	74.61
CLS	SciCite	83.96	80.75	85.49
DEP	GENIA (LAS)	53.33	53.23	36.31
DEP	GENIA (UAS)	57.88	58.57	41.68
Avg.		74.91	73.55	71.19

This variant is the highest-scoring of the compared models on BC5CDR, ChemProt, SciERC (REL), CitationIntent, and GENIA (LAS), and has the best overall average (74.91) — ahead of SciBERT (71.19), RoBERTa (65.50), ModernBERT (65.51), and BERT (59.47). Full per-task numbers (including BERT, RoBERTa, and ModernBERT baselines with standard deviations) are in the paper.

Limitations and biases

English only. The pre-training corpus was filtered to English; performance on other languages is not guaranteed.
Domain skew. The open-access corpus is heavily biomedical / life-sciences, with smaller but meaningful coverage of physical sciences, CS, and social sciences. Expect stronger performance on biomedical text than on, e.g., pure-CS or humanities benchmarks (visible in the lower SciERC-NER score).
No alignment. The model has not been instruction- or safety-tuned. It reflects the distribution of the scientific literature it was trained on, including any biases therein.
Encoder only. Not suitable for open-ended generation.

License

Released under the Apache License 2.0.

Citation

If you use this model, please cite the accompanying paper:

@misc{foppiano2025scilad,
  title        = {SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing},
  author       = {Foppiano, Luca and Takeshita, Sotaro and Ortiz Suarez, Pedro and
                  Borisova, Ekaterina and Abu Ahmad, Raia and Ostendorff, Malte and
                  Barth, Fabio and Moreno-Schneider, Julian and Rehm, Georg},
  year         = {2025},
  eprint       = {2512.11192},
  archivePrefix= {arXiv},
  url          = {https://arxiv.org/abs/2512.11192}
}

Acknowledgements

Developed by researchers affiliated with DFKI, Humboldt-Universität zu Berlin, ScienciaLAB, Common Crawl Foundation, and the University of Mannheim. Supported by NFDI4DS (DFG project 460234259) and the EU Horizon Europe project HIVEMIND (No. 101189745). Compute provided by ACCESS / JetStream 2 (allocation CIS220172), TACC, and the University of Mannheim / DFKI.