SciLaD-M (custom)

SciLaD-M-custom is a RoBERTa-base encoder pre-trained from scratch on the English split of the SciLaD corpus of scientific publications. Both the model weights and the tokenizer are learned from SciLaD, yielding a vocabulary tailored to scientific writing. It is intended as a drop-in domain-specific replacement for RoBERTa / SciBERT for fine-tuning on scientific NLP tasks.

A sibling model, scilons/SciLaD-M-roberta, reuses the original RoBERTa tokenizer instead.

Intended use

  • Primary: fine-tuning on scientific-text tasks — NER (biomedical, CS), PICO extraction, relation classification, citation-intent and field-of-research classification, dependency parsing.
  • Also: masked-token fill-in, contextual embeddings for retrieval or clustering of scientific documents.
  • Out of scope: text generation (this is an encoder, not a generative LM), instruction following, non-English text, high-stakes clinical/diagnostic use without further validation.

How to use

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline

tok = AutoTokenizer.from_pretrained("scilons/SciLaD-M-custom")
model = AutoModelForMaskedLM.from_pretrained("scilons/SciLaD-M-custom")

fill = pipeline("fill-mask", model=model, tokenizer=tok)
fill(f"Transformer models achieve state-of-the-art {tok.mask_token} on many NLP benchmarks.")

For fine-tuning on downstream tasks, load the appropriate task head, e.g. AutoModelForTokenClassification for NER or AutoModelForSequenceClassification for classification:

from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    "scilons/SciLaD-M-custom", num_labels=NUM_LABELS
)

Training data

Pre-trained on the English clean plain-text split of SciLaD, published on the Hugging Face Hub as scilons/SciLaD-all-text-v1: 10.99M scientific publications (68.7B tokens, avg. 6k tokens/doc) harvested from open-access sources via an entirely open-source pipeline:

  • Unpaywall snapshot (14 June 2023) → PDFs → TEI XML with Grobid
  • arXiv → LaTeX → TEI XML with LaTeXML
  • PMC / PLOS → JATS XML → TEI XML with Pub2TEI
  • Language filtering (weighted fastText ≥ 0.8 English), Gopher quality heuristics, and MinHash near-duplicate removal (5-grams, 112 hash functions, 75% Jaccard threshold) via Datatrove.

See the paper for full dataset statistics and the open pipeline.

Training procedure

  • Architecture: RoBERTa-base — 12 Transformer layers, ~110M parameters.
  • Objective: Masked Language Modeling (MLM).
  • Tokenizer: Byte-pair encoding trained from scratch on SciLaD; vocabulary size 50,265 (matching the original RoBERTa tokenizer for comparability).
  • Batching: batch size 256 × 2 gradient-accumulation steps → effective batch size 1024.
  • Hardware: 2 × NVIDIA H200 80GB.
  • Data split: 80 / 10 / 10 (train / validation / test) at document level.
  • Hyperparameters: otherwise as in the original RoBERTa configuration.

Evaluation

Fine-tuned on 11 scientific-NLP benchmarks (5 task families); F1 unless noted, averaged over 3 seeds. A compact view of SciLaD-M-custom vs. SciBERT and SciLaD-M-roberta:

Task Dataset SciLaD-M-custom SciLaD-M-roberta SciBERT
NER BC5CDR 97.20 96.54 95.07
NER JNLPBA 93.89 93.50 94.19
NER NCBI-disease 91.48 91.63 87.09
NER SciERC 41.03 42.49 54.39
PICO EBM-NLP 77.66 78.47 78.14
REL ChemProt (micro) 83.83 82.64 82.83
REL SciERC 81.23 72.10 77.10
CLS CitationIntent 64.46 59.08 47.33
CLS MAG 73.01 73.59 74.61
CLS SciCite 83.96 80.75 85.49
DEP GENIA (LAS) 53.33 53.23 36.31
DEP GENIA (UAS) 57.88 58.57 41.68
Avg. 74.91 73.55 71.19

This variant is the highest-scoring of the compared models on BC5CDR, ChemProt, SciERC (REL), CitationIntent, and GENIA (LAS), and has the best overall average (74.91) — ahead of SciBERT (71.19), RoBERTa (65.50), ModernBERT (65.51), and BERT (59.47). Full per-task numbers (including BERT, RoBERTa, and ModernBERT baselines with standard deviations) are in the paper.

Limitations and biases

  • English only. The pre-training corpus was filtered to English; performance on other languages is not guaranteed.
  • Domain skew. The open-access corpus is heavily biomedical / life-sciences, with smaller but meaningful coverage of physical sciences, CS, and social sciences. Expect stronger performance on biomedical text than on, e.g., pure-CS or humanities benchmarks (visible in the lower SciERC-NER score).
  • No alignment. The model has not been instruction- or safety-tuned. It reflects the distribution of the scientific literature it was trained on, including any biases therein.
  • Encoder only. Not suitable for open-ended generation.

License

Released under the Apache License 2.0.

Citation

If you use this model, please cite the accompanying paper:

@misc{foppiano2025scilad,
  title        = {SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing},
  author       = {Foppiano, Luca and Takeshita, Sotaro and Ortiz Suarez, Pedro and
                  Borisova, Ekaterina and Abu Ahmad, Raia and Ostendorff, Malte and
                  Barth, Fabio and Moreno-Schneider, Julian and Rehm, Georg},
  year         = {2025},
  eprint       = {2512.11192},
  archivePrefix= {arXiv},
  url          = {https://arxiv.org/abs/2512.11192}
}

Acknowledgements

Developed by researchers affiliated with DFKI, Humboldt-Universität zu Berlin, ScienciaLAB, Common Crawl Foundation, and the University of Mannheim. Supported by NFDI4DS (DFG project 460234259) and the EU Horizon Europe project HIVEMIND (No. 101189745). Compute provided by ACCESS / JetStream 2 (allocation CIS220172), TACC, and the University of Mannheim / DFKI.

Downloads last month
52
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train scilons/SciLaD-M-custom

Collection including scilons/SciLaD-M-custom

Papers for scilons/SciLaD-M-custom

Evaluation results