SciLaD-M (custom)
SciLaD-M-custom is a RoBERTa-base encoder pre-trained from scratch on the English split of the SciLaD corpus of scientific publications. Both the model weights and the tokenizer are learned from SciLaD, yielding a vocabulary tailored to scientific writing. It is intended as a drop-in domain-specific replacement for RoBERTa / SciBERT for fine-tuning on scientific NLP tasks.
A sibling model, scilons/SciLaD-M-roberta, reuses the original RoBERTa tokenizer instead.
Intended use
- Primary: fine-tuning on scientific-text tasks — NER (biomedical, CS), PICO extraction, relation classification, citation-intent and field-of-research classification, dependency parsing.
- Also: masked-token fill-in, contextual embeddings for retrieval or clustering of scientific documents.
- Out of scope: text generation (this is an encoder, not a generative LM), instruction following, non-English text, high-stakes clinical/diagnostic use without further validation.
How to use
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
tok = AutoTokenizer.from_pretrained("scilons/SciLaD-M-custom")
model = AutoModelForMaskedLM.from_pretrained("scilons/SciLaD-M-custom")
fill = pipeline("fill-mask", model=model, tokenizer=tok)
fill(f"Transformer models achieve state-of-the-art {tok.mask_token} on many NLP benchmarks.")
For fine-tuning on downstream tasks, load the appropriate task head, e.g. AutoModelForTokenClassification for NER or AutoModelForSequenceClassification for classification:
from transformers import AutoModelForTokenClassification
model = AutoModelForTokenClassification.from_pretrained(
"scilons/SciLaD-M-custom", num_labels=NUM_LABELS
)
Training data
Pre-trained on the English clean plain-text split of SciLaD, published on the Hugging Face Hub as scilons/SciLaD-all-text-v1: 10.99M scientific publications (68.7B tokens, avg. 6k tokens/doc) harvested from open-access sources via an entirely open-source pipeline:
- Unpaywall snapshot (14 June 2023) → PDFs → TEI XML with Grobid
- arXiv → LaTeX → TEI XML with LaTeXML
- PMC / PLOS → JATS XML → TEI XML with Pub2TEI
- Language filtering (weighted fastText ≥ 0.8 English), Gopher quality heuristics, and MinHash near-duplicate removal (5-grams, 112 hash functions, 75% Jaccard threshold) via Datatrove.
See the paper for full dataset statistics and the open pipeline.
Training procedure
- Architecture: RoBERTa-base — 12 Transformer layers, ~110M parameters.
- Objective: Masked Language Modeling (MLM).
- Tokenizer: Byte-pair encoding trained from scratch on SciLaD; vocabulary size 50,265 (matching the original RoBERTa tokenizer for comparability).
- Batching: batch size 256 × 2 gradient-accumulation steps → effective batch size 1024.
- Hardware: 2 × NVIDIA H200 80GB.
- Data split: 80 / 10 / 10 (train / validation / test) at document level.
- Hyperparameters: otherwise as in the original RoBERTa configuration.
Evaluation
Fine-tuned on 11 scientific-NLP benchmarks (5 task families); F1 unless noted, averaged over 3 seeds. A compact view of SciLaD-M-custom vs. SciBERT and SciLaD-M-roberta:
| Task | Dataset | SciLaD-M-custom | SciLaD-M-roberta | SciBERT |
|---|---|---|---|---|
| NER | BC5CDR | 97.20 | 96.54 | 95.07 |
| NER | JNLPBA | 93.89 | 93.50 | 94.19 |
| NER | NCBI-disease | 91.48 | 91.63 | 87.09 |
| NER | SciERC | 41.03 | 42.49 | 54.39 |
| PICO | EBM-NLP | 77.66 | 78.47 | 78.14 |
| REL | ChemProt (micro) | 83.83 | 82.64 | 82.83 |
| REL | SciERC | 81.23 | 72.10 | 77.10 |
| CLS | CitationIntent | 64.46 | 59.08 | 47.33 |
| CLS | MAG | 73.01 | 73.59 | 74.61 |
| CLS | SciCite | 83.96 | 80.75 | 85.49 |
| DEP | GENIA (LAS) | 53.33 | 53.23 | 36.31 |
| DEP | GENIA (UAS) | 57.88 | 58.57 | 41.68 |
| Avg. | 74.91 | 73.55 | 71.19 |
This variant is the highest-scoring of the compared models on BC5CDR, ChemProt, SciERC (REL), CitationIntent, and GENIA (LAS), and has the best overall average (74.91) — ahead of SciBERT (71.19), RoBERTa (65.50), ModernBERT (65.51), and BERT (59.47). Full per-task numbers (including BERT, RoBERTa, and ModernBERT baselines with standard deviations) are in the paper.
Limitations and biases
- English only. The pre-training corpus was filtered to English; performance on other languages is not guaranteed.
- Domain skew. The open-access corpus is heavily biomedical / life-sciences, with smaller but meaningful coverage of physical sciences, CS, and social sciences. Expect stronger performance on biomedical text than on, e.g., pure-CS or humanities benchmarks (visible in the lower SciERC-NER score).
- No alignment. The model has not been instruction- or safety-tuned. It reflects the distribution of the scientific literature it was trained on, including any biases therein.
- Encoder only. Not suitable for open-ended generation.
License
Released under the Apache License 2.0.
Citation
If you use this model, please cite the accompanying paper:
@misc{foppiano2025scilad,
title = {SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing},
author = {Foppiano, Luca and Takeshita, Sotaro and Ortiz Suarez, Pedro and
Borisova, Ekaterina and Abu Ahmad, Raia and Ostendorff, Malte and
Barth, Fabio and Moreno-Schneider, Julian and Rehm, Georg},
year = {2025},
eprint = {2512.11192},
archivePrefix= {arXiv},
url = {https://arxiv.org/abs/2512.11192}
}
Acknowledgements
Developed by researchers affiliated with DFKI, Humboldt-Universität zu Berlin, ScienciaLAB, Common Crawl Foundation, and the University of Mannheim. Supported by NFDI4DS (DFG project 460234259) and the EU Horizon Europe project HIVEMIND (No. 101189745). Compute provided by ACCESS / JetStream 2 (allocation CIS220172), TACC, and the University of Mannheim / DFKI.
- Downloads last month
- 52
Dataset used to train scilons/SciLaD-M-custom
Collection including scilons/SciLaD-M-custom
Papers for scilons/SciLaD-M-custom
SciLaD: A Large-Scale, Transparent, Reproducible Dataset for Natural Scientific Language Processing
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Evaluation results
- F1 (macro) on BC5CDRself-reported97.200
- F1 (macro) on JNLPBAself-reported93.890
- F1 (macro) on NCBI-diseaseself-reported91.480
- F1 (macro) on SciERCself-reported41.030
- F1 (macro) on EBM-NLPself-reported77.660
- F1 (micro) on ChemProtself-reported83.830
- F1 (macro) on SciERCself-reported81.230
- F1 (macro) on ACL-ARCself-reported64.460
- F1 (macro) on MAGself-reported73.010
- F1 (macro) on SciCiteself-reported83.960
- LAS on GENIAself-reported53.330
- UAS on GENIAself-reported57.880