ChEMU NER (BioBERT)

A BertForTokenClassification model fine-tuned on the ChEMU 2020 Task 1 chemical patent NER corpus. Given a reaction description it identifies 10 types of reaction-step entities (reactants, catalysts, solvents, products, conditions, yields, labels).

Base encoder: dmis-lab/biobert-base-cased-v1.2

Results

Held-out evaluation on the official ChEMU 2020 NER dev split (225 documents, 3,843 entities), exact-match micro-F1:

Entity type P R F1 N
STARTING_MATERIAL .8647 .9128 .8881 413
REAGENT_CATALYST .9085 .8927 .9005 289
REACTION_PRODUCT .9406 .9704 .9553 506
SOLVENT .9451 .9640 .9545 250
OTHER_COMPOUND .9703 .9676 .9689 1080
TEMPERATURE .9744 .9884 .9813 346
TIME .9804 .9921 .9862 252
YIELD_PERCENT 1.000 1.000 1.000 228
YIELD_OTHER .9811 .9923 .9867 261
EXAMPLE_LABEL .9862 .9862 .9862 218
MICRO .9527 .9644 .9585 3843

For reference, the official BANNER baseline on the same task scores 0.8893 exact-match F1; this model is +6.9 pt above BANNER.

Entity types

Label Role Examples
STARTING_MATERIAL reactant providing the core skeleton aniline, benzyl bromide
REAGENT_CATALYST reagent, catalyst, base, oxidant, reductant sodium hydride, DIPEA
REACTION_PRODUCT target product of the reaction tert-butyl 2-(4-pyridyl)pyrrolidine-1-carboxylate
SOLVENT reaction / extraction medium THF, dioxane, acetonitrile
OTHER_COMPOUND auxiliary: brines, drying agents, washes, by-products brine, celite, ethyl acetate
TEMPERATURE reaction temperature or range 50 °C, room temperature
TIME elapsed reaction time 2 h, overnight, 30 min
YIELD_PERCENT yield expressed as percentage 56%, quantitative
YIELD_OTHER yield expressed as mass or moles 1.30 g, 2.5 mmol
EXAMPLE_LABEL compound / example identifiers Example 5, (1), 14

Usage

High-level: HuggingFace pipeline

from transformers import pipeline

ner = pipeline(
    "token-classification",
    model="mpkato/chemu-biobert-ner",
    aggregation_strategy="simple",
)

text = (
    "Under blue LED light, N-Boc-pyrrolidine was coupled with "
    "4-cyanopyridine in acetonitrile using [Ru(bpy)3]Cl2 as the "
    "photocatalyst and DIPEA as the reductant to afford tert-butyl "
    "2-(4-pyridyl)pyrrolidine-1-carboxylate."
)
for ent in ner(text):
    print(f"{ent['entity_group']:20s} {ent['start']:4d}-{ent['end']:4d}  {ent['word']}")

Handling long patents

The model has a 512-token positional limit (inherited from BERT). For patent paragraphs longer than that, enable the pipeline's built-in chunking or split the text yourself. A typical pattern:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tok = AutoTokenizer.from_pretrained("mpkato/chemu-biobert-ner")
model = AutoModelForTokenClassification.from_pretrained(
    "mpkato/chemu-biobert-ner"
).eval()

enc = tok(
    long_text,
    return_offsets_mapping=True,
    return_overflowing_tokens=True,
    max_length=512,
    stride=64,
    truncation=True,
    return_tensors="pt",
)
with torch.no_grad():
    logits = model(
        input_ids=enc["input_ids"],
        attention_mask=enc["attention_mask"],
    ).logits
tags = logits.argmax(dim=-1)  # (num_windows, seq_len)
# then walk `enc["offset_mapping"]` to recover entity spans in the
# original text; take care to dedupe entities that appear in the
# overlapping regions of two windows.

Training

  • Data: ChEMU 2020 Task 1 public release (train 900 docs, dev 225 docs; CC BY-NC 3.0).
  • Internal split: the 900 train docs are split 90 / 10 with a fixed seed into 810 training docs and 90 internal validation docs; the dev set is kept as a clean held-out evaluation set (never seen during training or model selection).
  • Optimizer: AdamW, learning rate 5e-5 (BERT body) / 5e-4 (classifier head), weight decay 0.01, linear warm-up 10%, gradient clipping 1.0.
  • Schedule: batch size 16, max sequence length 512, stride 64, dropout 0.2, up to 15 epochs with early stopping (patience = 3) on the internal validation F1. Best epoch: 8.
  • Hardware: single NVIDIA RTX A6000 (48 GB). Training runs in under 10 minutes.

The tokenizer and training pipeline use HF's default BERT pre-tokenization (runs of word characters plus single punctuation tokens), so no custom preprocessing is required at inference time.

Limitations

  • Trained on the ChEMU 2020 dev distribution, which is biased toward organic-synthesis patents. Performance on other chemical sub-domains (materials, catalysis datasets, inorganic chemistry) is unverified.
  • Compound-name types (STARTING_MATERIAL, REAGENT_CATALYST) sit around 0.88-0.90 F1, about 8 points below the high-coverage types. The most common error modes are (i) splitting multi-word names such as "Intermediate 6" into two spans and (ii) confusing high-frequency words (aqueous, ammonium chloride, water, methanol) between SOLVENT / REAGENT_CATALYST / OTHER_COMPOUND. See the accompanying technical note for a full failure analysis.
  • The training data is licensed CC BY-NC 3.0, so this model is released for non-commercial research use only.

Citation

If you use this model, please cite the ChEMU 2020 overview paper and the BioBERT paper:

@incollection{he2020chemu,
  author = {He, Jiayuan and Nguyen, Dat Quoc and Akhondi, Saber A. and
            Druckenbrodt, Christian and Thorne, Camilo and Hoessel, Ralph
            and Afzal, Zubair and Zhai, Zenan and Fang, Biaoyan and
            Yoshikawa, Hiyori and Albahem, Ameer and Cavedon, Lawrence
            and Cohn, Trevor and Baldwin, Timothy and Verspoor, Karin},
  title = {Overview of ChEMU 2020: Named Entity Recognition and Event
           Extraction of Chemical Reactions from Patents},
  booktitle = {Experimental IR Meets Multilinguality, Multimodality, and
               Interaction. Proceedings of the Eleventh International
               Conference of the CLEF Association (CLEF 2020)},
  year = 2020,
}

@article{lee2020biobert,
  title={BioBERT: a pre-trained biomedical language representation model
         for biomedical text mining},
  author={Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and
          Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang,
          Jaewoo},
  journal={Bioinformatics},
  volume={36},
  number={4},
  pages={1234--1240},
  year={2020},
}
Downloads last month
22
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mpkato/chemu-biobert-ner

Finetuned
(34)
this model

Space using mpkato/chemu-biobert-ner 1