ChEMU NER (BioBERT)

A BertForTokenClassification model fine-tuned on the ChEMU 2020 Task 1 chemical patent NER corpus. Given a reaction description it identifies 10 types of reaction-step entities (reactants, catalysts, solvents, products, conditions, yields, labels).

Base encoder: dmis-lab/biobert-base-cased-v1.2

Results

Held-out evaluation on the official ChEMU 2020 NER dev split (225 documents, 3,843 entities), exact-match micro-F1:

Entity type	P	R	F1	N
STARTING_MATERIAL	.8647	.9128	.8881	413
REAGENT_CATALYST	.9085	.8927	.9005	289
REACTION_PRODUCT	.9406	.9704	.9553	506
SOLVENT	.9451	.9640	.9545	250
OTHER_COMPOUND	.9703	.9676	.9689	1080
TEMPERATURE	.9744	.9884	.9813	346
TIME	.9804	.9921	.9862	252
YIELD_PERCENT	1.000	1.000	1.000	228
YIELD_OTHER	.9811	.9923	.9867	261
EXAMPLE_LABEL	.9862	.9862	.9862	218
MICRO	.9527	.9644	.9585	3843

For reference, the official BANNER baseline on the same task scores 0.8893 exact-match F1; this model is +6.9 pt above BANNER.

Entity types

Label	Role	Examples
`STARTING_MATERIAL`	reactant providing the core skeleton	`aniline`, `benzyl bromide`
`REAGENT_CATALYST`	reagent, catalyst, base, oxidant, reductant	`sodium hydride`, `DIPEA`
`REACTION_PRODUCT`	target product of the reaction	`tert-butyl 2-(4-pyridyl)pyrrolidine-1-carboxylate`
`SOLVENT`	reaction / extraction medium	`THF`, `dioxane`, `acetonitrile`
`OTHER_COMPOUND`	auxiliary: brines, drying agents, washes, by-products	`brine`, `celite`, `ethyl acetate`
`TEMPERATURE`	reaction temperature or range	`50 °C`, `room temperature`
`TIME`	elapsed reaction time	`2 h`, `overnight`, `30 min`
`YIELD_PERCENT`	yield expressed as percentage	`56%`, `quantitative`
`YIELD_OTHER`	yield expressed as mass or moles	`1.30 g`, `2.5 mmol`
`EXAMPLE_LABEL`	compound / example identifiers	`Example 5`, `(1)`, `14`

Usage

High-level: HuggingFace pipeline

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="mpkato/chemu-biobert-ner",
    aggregation_strategy="simple",
)

text = (
    "Under blue LED light, N-Boc-pyrrolidine was coupled with "
    "4-cyanopyridine in acetonitrile using [Ru(bpy)3]Cl2 as the "
    "photocatalyst and DIPEA as the reductant to afford tert-butyl "
    "2-(4-pyridyl)pyrrolidine-1-carboxylate."
)
for ent in ner(text):
    print(f"{ent['entity_group']:20s} {ent['start']:4d}-{ent['end']:4d}  {ent['word']}")

Handling long patents

The model has a 512-token positional limit (inherited from BERT). For patent paragraphs longer than that, enable the pipeline's built-in chunking or split the text yourself. A typical pattern:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tok = AutoTokenizer.from_pretrained("mpkato/chemu-biobert-ner")
model = AutoModelForTokenClassification.from_pretrained(
    "mpkato/chemu-biobert-ner"
).eval()

enc = tok(
    long_text,
    return_offsets_mapping=True,
    return_overflowing_tokens=True,
    max_length=512,
    stride=64,
    truncation=True,
    return_tensors="pt",
)
with torch.no_grad():
    logits = model(
        input_ids=enc["input_ids"],
        attention_mask=enc["attention_mask"],
    ).logits
tags = logits.argmax(dim=-1)  # (num_windows, seq_len)
# then walk `enc["offset_mapping"]` to recover entity spans in the
# original text; take care to dedupe entities that appear in the
# overlapping regions of two windows.

Training

Data: ChEMU 2020 Task 1 public release (train 900 docs, dev 225 docs; CC BY-NC 3.0).
Internal split: the 900 train docs are split 90 / 10 with a fixed seed into 810 training docs and 90 internal validation docs; the dev set is kept as a clean held-out evaluation set (never seen during training or model selection).
Optimizer: AdamW, learning rate 5e-5 (BERT body) / 5e-4 (classifier head), weight decay 0.01, linear warm-up 10%, gradient clipping 1.0.
Schedule: batch size 16, max sequence length 512, stride 64, dropout 0.2, up to 15 epochs with early stopping (patience = 3) on the internal validation F1. Best epoch: 8.
Hardware: single NVIDIA RTX A6000 (48 GB). Training runs in under 10 minutes.

The tokenizer and training pipeline use HF's default BERT pre-tokenization (runs of word characters plus single punctuation tokens), so no custom preprocessing is required at inference time.

Limitations

Trained on the ChEMU 2020 dev distribution, which is biased toward organic-synthesis patents. Performance on other chemical sub-domains (materials, catalysis datasets, inorganic chemistry) is unverified.
Compound-name types (STARTING_MATERIAL, REAGENT_CATALYST) sit around 0.88-0.90 F1, about 8 points below the high-coverage types. The most common error modes are (i) splitting multi-word names such as "Intermediate 6" into two spans and (ii) confusing high-frequency words (aqueous, ammonium chloride, water, methanol) between SOLVENT / REAGENT_CATALYST / OTHER_COMPOUND. See the accompanying technical note for a full failure analysis.
The training data is licensed CC BY-NC 3.0, so this model is released for non-commercial research use only.

Citation

If you use this model, please cite the ChEMU 2020 overview paper and the BioBERT paper:

@incollection{he2020chemu,
  author = {He, Jiayuan and Nguyen, Dat Quoc and Akhondi, Saber A. and
            Druckenbrodt, Christian and Thorne, Camilo and Hoessel, Ralph
            and Afzal, Zubair and Zhai, Zenan and Fang, Biaoyan and
            Yoshikawa, Hiyori and Albahem, Ameer and Cavedon, Lawrence
            and Cohn, Trevor and Baldwin, Timothy and Verspoor, Karin},
  title = {Overview of ChEMU 2020: Named Entity Recognition and Event
           Extraction of Chemical Reactions from Patents},
  booktitle = {Experimental IR Meets Multilinguality, Multimodality, and
               Interaction. Proceedings of the Eleventh International
               Conference of the CLEF Association (CLEF 2020)},
  year = 2020,
}

@article{lee2020biobert,
  title={BioBERT: a pre-trained biomedical language representation model
         for biomedical text mining},
  author={Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and
          Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang,
          Jaewoo},
  journal={Bioinformatics},
  volume={36},
  number={4},
  pages={1234--1240},
  year={2020},
}

Downloads last month: 22

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for mpkato/chemu-biobert-ner

Base model

dmis-lab/biobert-base-cased-v1.2

Finetuned

(34)

this model

mpkato
/

chemu-biobert-ner