How to use from the
Use from the
Transformers library
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("sagawa/chemlm-86.24m", dtype="auto")
Quick Links

ChemLM 86.24M

ChemLM 86.24M is a small BERT-style chemical language model pre-trained with masked language modeling on molecular SMILES strings.

This checkpoint is intended to be used as an encoder initialization for downstream molecular property prediction tasks through the fine-tuning and linear probe code in the original repository.

Model Details

Hyperparameter Value
Hidden size 768
Number of hidden layers 12
Number of attention heads 12
Intermediate size 3072
Vocabulary size 2362
Maximum sequence length during pre-training 512

The model was pre-trained using the Academic Budget BERT-based implementation in the repository and converted from a DeepSpeed checkpoint to Hugging Face format.

Intended Use

This checkpoint is intended for downstream molecular property prediction by loading it with the custom model class defined in the repository:

from pretraining.configs import PretrainedBertConfig
from pretraining.modeling import BertForSequenceClassificationMolecule

The downstream model adds a task-specific classification or regression head on top of the pre-trained BERT encoder.

The supported training modes are:

  • full fine-tuning
  • linear probe, where the BERT encoder is frozen and only the prediction head is trained

Model Architecture for Downstream Tasks

For downstream molecular property prediction, the repository uses:

BertForSequenceClassificationMolecule

This model consists of:

  1. a pre-trained BertModel encoder,
  2. BertPoolerC, which mean-pools non-special tokens using the attention mask,
  3. dropout,
  4. a linear prediction head:
self.classifier = nn.Linear(config.hidden_size, self.num_labels)

The output is a Hugging Face SequenceClassifierOutput containing loss and logits.

The loss function depends on the task type:

Task type Number of labels Loss
Regression 1 MSELoss
Binary classification 2 CrossEntropyLoss
Multitask classification Number of target columns Masked BCEWithLogitsLoss

Usage

The intended usage is to load the pre-trained checkpoint inside the repository's downstream training script.

First, clone the original repository and install the fine-tuning environment:

git clone https://github.com/sagawatatsuya/chemlm_pretraining.git
cd chemlm_pretraining

conda create -n chemlm_finetuning python=3.11
conda activate chemlm_finetuning

pip install -r requirements.txt
pip install torch transformers==4.57.3
pip install -U accelerate deepspeed

Then run fine-tuning, for example on BBBP:

python chemlm_pretraining/run_ft_molecule.py \
  --model_name_or_path "sagawatatsuya/chemlm-86.24m" \
  --tokenizer_name "ibm-research/MoLFormer-XL-both-10pct" \
  --train_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/train.csv" \
  --validation_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/valid.csv" \
  --test_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/test.csv" \
  --task_name "bbbp" \
  --task_config "chemlm_pretraining/task_config.json" \
  --do_train \
  --do_eval \
  --do_predict \
  --per_device_train_batch_size 256 \
  --per_device_eval_batch_size 512 \
  --learning_rate 3e-5 \
  --num_train_epochs 500 \
  --save_strategy epoch \
  --eval_strategy epoch

For linear probe evaluation, set:

--training_type "linear_probe"

In this mode, the script freezes the parameters under model.bert and trains only the task-specific head.

Loading the Model in Python

The downstream script loads local converted checkpoints with the custom config and model class:

import json
from argparse import Namespace

from transformers import AutoTokenizer
from pretraining.configs import PretrainedBertConfig
from pretraining.modeling import BertForSequenceClassificationMolecule

model_name_or_path = "sagawatatsuya/chemlm-86.24m"
task_name = "bbbp"
task_config_path = "chemlm_pretraining/task_config.json"

task_to_keys = json.load(open(task_config_path))
task_info = task_to_keys[task_name]

if task_info["task_category"] == "regression":
    num_labels = 1
elif task_info["task_category"] == "classification":
    num_labels = 2
else:
    num_labels = len(task_info["target_columns"])

pretrain_run_args = json.load(open(f"{model_name_or_path}/args.json"))
ds_args = Namespace(**pretrain_run_args)

config = PretrainedBertConfig.from_pretrained(
    model_name_or_path,
    num_labels=num_labels,
    finetuning_task=task_name,
    layer_norm_type="pytorch",
    task_category=task_info["task_category"],
    fused_linear_layer=True,
    max_seq_length=task_info["max_seq_length"],
)

tokenizer = AutoTokenizer.from_pretrained(
    "ibm-research/MoLFormer-XL-both-10pct",
    trust_remote_code=True,
)

model = BertForSequenceClassificationMolecule.from_pretrained(
    model_name_or_path,
    config=config,
    args=ds_args,
)

For normal use, run_ft_molecule.py is recommended instead of manually writing this loading code.

Data

The pre-training data preparation follows the repository pipeline:

  1. download ZINC-15 and PubChem SMILES used in MoLFormer-style pre-training,
  2. preprocess, shard, and split the dataset,
  3. create masked language modeling samples using ibm-research/MoLFormer-XL-both-10pct.

Pre-training Objective

The model was pre-trained with masked language modeling.

The sample generation configuration uses:

Setting Value
Masked LM probability 0.15
Maximum sequence length 512
Maximum predictions per sequence 77
Tokenizer ibm-research/MoLFormer-XL-both-10pct

Downstream Tasks

The repository defines task metadata in chemlm_pretraining/task_config.json.

Supported task categories include:

  • binary classification, e.g. BBBP, BACE, HIV
  • multitask classification, e.g. Tox21, ClinTox, SIDER
  • regression, e.g. QM9 targets, ESOL, FreeSolv, Lipophilicity

Metrics are selected from the task config:

Task category Metric
Classification ROC-AUC
Multitask classification Mean ROC-AUC over tasks
Regression MAE or RMSE, depending on the task config

Citation

If you use this model, please cite:

@misc{sagawa2026largescalechemicallanguagemodels,
      title={How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?},
      author={Tatsuya Sagawa and Ryosuke Kojima},
      year={2026},
      eprint={2602.11618},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.11618},
}
Downloads last month
22
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including sagawa/chemlm-86.24m

Paper for sagawa/chemlm-86.24m