ChEMU NER (BioBERT)
A BertForTokenClassification model fine-tuned on the
ChEMU 2020 Task 1 chemical patent
NER corpus. Given a reaction description it identifies 10 types of
reaction-step entities (reactants, catalysts, solvents, products,
conditions, yields, labels).
Base encoder: dmis-lab/biobert-base-cased-v1.2
Results
Held-out evaluation on the official ChEMU 2020 NER dev split (225 documents, 3,843 entities), exact-match micro-F1:
| Entity type | P | R | F1 | N |
|---|---|---|---|---|
| STARTING_MATERIAL | .8647 | .9128 | .8881 | 413 |
| REAGENT_CATALYST | .9085 | .8927 | .9005 | 289 |
| REACTION_PRODUCT | .9406 | .9704 | .9553 | 506 |
| SOLVENT | .9451 | .9640 | .9545 | 250 |
| OTHER_COMPOUND | .9703 | .9676 | .9689 | 1080 |
| TEMPERATURE | .9744 | .9884 | .9813 | 346 |
| TIME | .9804 | .9921 | .9862 | 252 |
| YIELD_PERCENT | 1.000 | 1.000 | 1.000 | 228 |
| YIELD_OTHER | .9811 | .9923 | .9867 | 261 |
| EXAMPLE_LABEL | .9862 | .9862 | .9862 | 218 |
| MICRO | .9527 | .9644 | .9585 | 3843 |
For reference, the official BANNER baseline on the same task scores
0.8893 exact-match F1; this model is +6.9 pt above BANNER.
Entity types
| Label | Role | Examples |
|---|---|---|
STARTING_MATERIAL |
reactant providing the core skeleton | aniline, benzyl bromide |
REAGENT_CATALYST |
reagent, catalyst, base, oxidant, reductant | sodium hydride, DIPEA |
REACTION_PRODUCT |
target product of the reaction | tert-butyl 2-(4-pyridyl)pyrrolidine-1-carboxylate |
SOLVENT |
reaction / extraction medium | THF, dioxane, acetonitrile |
OTHER_COMPOUND |
auxiliary: brines, drying agents, washes, by-products | brine, celite, ethyl acetate |
TEMPERATURE |
reaction temperature or range | 50 °C, room temperature |
TIME |
elapsed reaction time | 2 h, overnight, 30 min |
YIELD_PERCENT |
yield expressed as percentage | 56%, quantitative |
YIELD_OTHER |
yield expressed as mass or moles | 1.30 g, 2.5 mmol |
EXAMPLE_LABEL |
compound / example identifiers | Example 5, (1), 14 |
Usage
High-level: HuggingFace pipeline
from transformers import pipeline
ner = pipeline(
"token-classification",
model="mpkato/chemu-biobert-ner",
aggregation_strategy="simple",
)
text = (
"Under blue LED light, N-Boc-pyrrolidine was coupled with "
"4-cyanopyridine in acetonitrile using [Ru(bpy)3]Cl2 as the "
"photocatalyst and DIPEA as the reductant to afford tert-butyl "
"2-(4-pyridyl)pyrrolidine-1-carboxylate."
)
for ent in ner(text):
print(f"{ent['entity_group']:20s} {ent['start']:4d}-{ent['end']:4d} {ent['word']}")
Handling long patents
The model has a 512-token positional limit (inherited from BERT). For patent paragraphs longer than that, enable the pipeline's built-in chunking or split the text yourself. A typical pattern:
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
tok = AutoTokenizer.from_pretrained("mpkato/chemu-biobert-ner")
model = AutoModelForTokenClassification.from_pretrained(
"mpkato/chemu-biobert-ner"
).eval()
enc = tok(
long_text,
return_offsets_mapping=True,
return_overflowing_tokens=True,
max_length=512,
stride=64,
truncation=True,
return_tensors="pt",
)
with torch.no_grad():
logits = model(
input_ids=enc["input_ids"],
attention_mask=enc["attention_mask"],
).logits
tags = logits.argmax(dim=-1) # (num_windows, seq_len)
# then walk `enc["offset_mapping"]` to recover entity spans in the
# original text; take care to dedupe entities that appear in the
# overlapping regions of two windows.
Training
- Data: ChEMU 2020 Task 1 public release (train 900 docs, dev 225 docs; CC BY-NC 3.0).
- Internal split: the 900 train docs are split 90 / 10 with a fixed seed into 810 training docs and 90 internal validation docs; the dev set is kept as a clean held-out evaluation set (never seen during training or model selection).
- Optimizer: AdamW, learning rate 5e-5 (BERT body) / 5e-4 (classifier head), weight decay 0.01, linear warm-up 10%, gradient clipping 1.0.
- Schedule: batch size 16, max sequence length 512, stride 64, dropout 0.2, up to 15 epochs with early stopping (patience = 3) on the internal validation F1. Best epoch: 8.
- Hardware: single NVIDIA RTX A6000 (48 GB). Training runs in under 10 minutes.
The tokenizer and training pipeline use HF's default BERT pre-tokenization (runs of word characters plus single punctuation tokens), so no custom preprocessing is required at inference time.
Limitations
- Trained on the ChEMU 2020 dev distribution, which is biased toward organic-synthesis patents. Performance on other chemical sub-domains (materials, catalysis datasets, inorganic chemistry) is unverified.
- Compound-name types (STARTING_MATERIAL, REAGENT_CATALYST) sit around
0.88-0.90 F1, about 8 points below the high-coverage types. The most
common error modes are (i) splitting multi-word names such as
"Intermediate 6" into two spans and (ii) confusing high-frequency
words (
aqueous,ammonium chloride,water,methanol) between SOLVENT / REAGENT_CATALYST / OTHER_COMPOUND. See the accompanying technical note for a full failure analysis. - The training data is licensed CC BY-NC 3.0, so this model is released for non-commercial research use only.
Citation
If you use this model, please cite the ChEMU 2020 overview paper and the BioBERT paper:
@incollection{he2020chemu,
author = {He, Jiayuan and Nguyen, Dat Quoc and Akhondi, Saber A. and
Druckenbrodt, Christian and Thorne, Camilo and Hoessel, Ralph
and Afzal, Zubair and Zhai, Zenan and Fang, Biaoyan and
Yoshikawa, Hiyori and Albahem, Ameer and Cavedon, Lawrence
and Cohn, Trevor and Baldwin, Timothy and Verspoor, Karin},
title = {Overview of ChEMU 2020: Named Entity Recognition and Event
Extraction of Chemical Reactions from Patents},
booktitle = {Experimental IR Meets Multilinguality, Multimodality, and
Interaction. Proceedings of the Eleventh International
Conference of the CLEF Association (CLEF 2020)},
year = 2020,
}
@article{lee2020biobert,
title={BioBERT: a pre-trained biomedical language representation model
for biomedical text mining},
author={Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and
Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang,
Jaewoo},
journal={Bioinformatics},
volume={36},
number={4},
pages={1234--1240},
year={2020},
}
- Downloads last month
- 22
Model tree for mpkato/chemu-biobert-ner
Base model
dmis-lab/biobert-base-cased-v1.2