chemlm-0.83m / README.md
sagawa's picture
Update README.md
e97bc24 verified
---
library_name: transformers
tags:
- chemistry
- smiles
- molecular-property-prediction
- masked-language-modeling
- bert
license: apache-2.0
---
# ChemLM 0.83M
ChemLM 0.83M is a small BERT-style chemical language model pre-trained with masked language modeling on molecular SMILES strings.
This checkpoint is intended to be used as an encoder initialization for downstream molecular property prediction tasks through the fine-tuning and linear probe code in the original repository.
- Repository: https://github.com/sagawatatsuya/chemlm_pretraining
- Paper: *How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?*
- Tokenizer: `ibm-research/MoLFormer-XL-both-10pct`
- Model size: 0.83M parameters
## Model Details
| Hyperparameter | Value |
|---|---:|
| Hidden size | 128 |
| Number of hidden layers | 4 |
| Number of attention heads | 2 |
| Intermediate size | 512 |
| Vocabulary size | 2362 |
| Maximum sequence length during pre-training | 512 |
The model was pre-trained using the Academic Budget BERT-based implementation in the repository and converted from a DeepSpeed checkpoint to Hugging Face format.
## Intended Use
This checkpoint is intended for downstream molecular property prediction by loading it with the custom model class defined in the repository:
```python
from pretraining.configs import PretrainedBertConfig
from pretraining.modeling import BertForSequenceClassificationMolecule
````
The downstream model adds a task-specific classification or regression head on top of the pre-trained BERT encoder.
The supported training modes are:
* full fine-tuning
* linear probe, where the BERT encoder is frozen and only the prediction head is trained
## Model Architecture for Downstream Tasks
For downstream molecular property prediction, the repository uses:
```python
BertForSequenceClassificationMolecule
```
This model consists of:
1. a pre-trained `BertModel` encoder,
2. `BertPoolerC`, which mean-pools non-special tokens using the attention mask,
3. dropout,
4. a linear prediction head:
```python
self.classifier = nn.Linear(config.hidden_size, self.num_labels)
```
The output is a Hugging Face `SequenceClassifierOutput` containing `loss` and `logits`.
The loss function depends on the task type:
| Task type | Number of labels | Loss |
| ------------------------ | -----------------------: | ------------------------ |
| Regression | 1 | MSELoss |
| Binary classification | 2 | CrossEntropyLoss |
| Multitask classification | Number of target columns | Masked BCEWithLogitsLoss |
## Usage
The intended usage is to load the pre-trained checkpoint inside the repository's downstream training script.
First, clone the original repository and install the fine-tuning environment:
```bash
git clone https://github.com/sagawatatsuya/chemlm_pretraining.git
cd chemlm_pretraining
conda create -n chemlm_finetuning python=3.11
conda activate chemlm_finetuning
pip install -r requirements.txt
pip install torch transformers==4.57.3
pip install -U accelerate deepspeed
```
Then run fine-tuning, for example on BBBP:
```bash
python chemlm_pretraining/run_ft_molecule.py \
--model_name_or_path "sagawatatsuya/chemlm-0.83m" \
--tokenizer_name "ibm-research/MoLFormer-XL-both-10pct" \
--train_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/train.csv" \
--validation_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/valid.csv" \
--test_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/test.csv" \
--task_name "bbbp" \
--task_config "chemlm_pretraining/task_config.json" \
--do_train \
--do_eval \
--do_predict \
--per_device_train_batch_size 256 \
--per_device_eval_batch_size 512 \
--learning_rate 3e-5 \
--num_train_epochs 500 \
--save_strategy epoch \
--eval_strategy epoch
```
For linear probe evaluation, set:
```bash
--training_type "linear_probe"
```
In this mode, the script freezes the parameters under `model.bert` and trains only the task-specific head.
## Loading the Model in Python
The downstream script loads local converted checkpoints with the custom config and model class:
```python
import json
from argparse import Namespace
from transformers import AutoTokenizer
from pretraining.configs import PretrainedBertConfig
from pretraining.modeling import BertForSequenceClassificationMolecule
model_name_or_path = "sagawatatsuya/chemlm-0.83m"
task_name = "bbbp"
task_config_path = "chemlm_pretraining/task_config.json"
task_to_keys = json.load(open(task_config_path))
task_info = task_to_keys[task_name]
if task_info["task_category"] == "regression":
num_labels = 1
elif task_info["task_category"] == "classification":
num_labels = 2
else:
num_labels = len(task_info["target_columns"])
pretrain_run_args = json.load(open(f"{model_name_or_path}/args.json"))
ds_args = Namespace(**pretrain_run_args)
config = PretrainedBertConfig.from_pretrained(
model_name_or_path,
num_labels=num_labels,
finetuning_task=task_name,
layer_norm_type="pytorch",
task_category=task_info["task_category"],
fused_linear_layer=True,
max_seq_length=task_info["max_seq_length"],
)
tokenizer = AutoTokenizer.from_pretrained(
"ibm-research/MoLFormer-XL-both-10pct",
trust_remote_code=True,
)
model = BertForSequenceClassificationMolecule.from_pretrained(
model_name_or_path,
config=config,
args=ds_args,
)
```
For normal use, `run_ft_molecule.py` is recommended instead of manually writing this loading code.
## Data
The pre-training data preparation follows the repository pipeline:
1. download ZINC-15 and PubChem SMILES used in MoLFormer-style pre-training,
2. preprocess, shard, and split the dataset,
3. create masked language modeling samples using `ibm-research/MoLFormer-XL-both-10pct`.
## Pre-training Objective
The model was pre-trained with masked language modeling.
The sample generation configuration uses:
| Setting | Value |
| -------------------------------- | -------------------------------------: |
| Masked LM probability | 0.15 |
| Maximum sequence length | 512 |
| Maximum predictions per sequence | 77 |
| Tokenizer | `ibm-research/MoLFormer-XL-both-10pct` |
## Downstream Tasks
The repository defines task metadata in `chemlm_pretraining/task_config.json`.
Supported task categories include:
* binary classification, e.g. BBBP, BACE, HIV
* multitask classification, e.g. Tox21, ClinTox, SIDER
* regression, e.g. QM9 targets, ESOL, FreeSolv, Lipophilicity
Metrics are selected from the task config:
| Task category | Metric |
| ------------------------ | ----------------------------------------- |
| Classification | ROC-AUC |
| Multitask classification | Mean ROC-AUC over tasks |
| Regression | MAE or RMSE, depending on the task config |
## Citation
If you use this model, please cite:
```bibtex
@misc{sagawa2026largescalechemicallanguagemodels,
title={How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?},
author={Tatsuya Sagawa and Ryosuke Kojima},
year={2026},
eprint={2602.11618},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.11618},
}
```