Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,235 @@
|
|
| 1 |
-
---
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
library_name: transformers
|
| 3 |
+
tags:
|
| 4 |
+
- chemistry
|
| 5 |
+
- smiles
|
| 6 |
+
- molecular-property-prediction
|
| 7 |
+
- masked-language-modeling
|
| 8 |
+
- bert
|
| 9 |
+
license: apache-2.0
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# ChemLM 2.30M
|
| 13 |
+
|
| 14 |
+
ChemLM 2.30M is a small BERT-style chemical language model pre-trained with masked language modeling on molecular SMILES strings.
|
| 15 |
+
|
| 16 |
+
This checkpoint is intended to be used as an encoder initialization for downstream molecular property prediction tasks through the fine-tuning and linear probe code in the original repository.
|
| 17 |
+
|
| 18 |
+
- Repository: https://github.com/sagawatatsuya/chemlm_pretraining
|
| 19 |
+
- Paper: *How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?*
|
| 20 |
+
- Tokenizer: `ibm-research/MoLFormer-XL-both-10pct`
|
| 21 |
+
- Model size: 2.30M parameters
|
| 22 |
+
|
| 23 |
+
## Model Details
|
| 24 |
+
|
| 25 |
+
| Hyperparameter | Value |
|
| 26 |
+
|---|---:|
|
| 27 |
+
| Hidden size | 192 |
|
| 28 |
+
| Number of hidden layers | 5 |
|
| 29 |
+
| Number of attention heads | 3 |
|
| 30 |
+
| Intermediate size | 768 |
|
| 31 |
+
| Vocabulary size | 2362 |
|
| 32 |
+
| Maximum sequence length during pre-training | 512 |
|
| 33 |
+
|
| 34 |
+
The model was pre-trained using the Academic Budget BERT-based implementation in the repository and converted from a DeepSpeed checkpoint to Hugging Face format.
|
| 35 |
+
|
| 36 |
+
## Intended Use
|
| 37 |
+
|
| 38 |
+
This checkpoint is intended for downstream molecular property prediction by loading it with the custom model class defined in the repository:
|
| 39 |
+
|
| 40 |
+
```python
|
| 41 |
+
from pretraining.configs import PretrainedBertConfig
|
| 42 |
+
from pretraining.modeling import BertForSequenceClassificationMolecule
|
| 43 |
+
````
|
| 44 |
+
|
| 45 |
+
The downstream model adds a task-specific classification or regression head on top of the pre-trained BERT encoder.
|
| 46 |
+
|
| 47 |
+
The supported training modes are:
|
| 48 |
+
|
| 49 |
+
* full fine-tuning
|
| 50 |
+
* linear probe, where the BERT encoder is frozen and only the prediction head is trained
|
| 51 |
+
|
| 52 |
+
## Model Architecture for Downstream Tasks
|
| 53 |
+
|
| 54 |
+
For downstream molecular property prediction, the repository uses:
|
| 55 |
+
|
| 56 |
+
```python
|
| 57 |
+
BertForSequenceClassificationMolecule
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
This model consists of:
|
| 61 |
+
|
| 62 |
+
1. a pre-trained `BertModel` encoder,
|
| 63 |
+
2. `BertPoolerC`, which mean-pools non-special tokens using the attention mask,
|
| 64 |
+
3. dropout,
|
| 65 |
+
4. a linear prediction head:
|
| 66 |
+
|
| 67 |
+
```python
|
| 68 |
+
self.classifier = nn.Linear(config.hidden_size, self.num_labels)
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
The output is a Hugging Face `SequenceClassifierOutput` containing `loss` and `logits`.
|
| 72 |
+
|
| 73 |
+
The loss function depends on the task type:
|
| 74 |
+
|
| 75 |
+
| Task type | Number of labels | Loss |
|
| 76 |
+
| ------------------------ | -----------------------: | ------------------------ |
|
| 77 |
+
| Regression | 1 | MSELoss |
|
| 78 |
+
| Binary classification | 2 | CrossEntropyLoss |
|
| 79 |
+
| Multitask classification | Number of target columns | Masked BCEWithLogitsLoss |
|
| 80 |
+
|
| 81 |
+
## Usage
|
| 82 |
+
|
| 83 |
+
The intended usage is to load the pre-trained checkpoint inside the repository's downstream training script.
|
| 84 |
+
|
| 85 |
+
First, clone the original repository and install the fine-tuning environment:
|
| 86 |
+
|
| 87 |
+
```bash
|
| 88 |
+
git clone https://github.com/sagawatatsuya/chemlm_pretraining.git
|
| 89 |
+
cd chemlm_pretraining
|
| 90 |
+
|
| 91 |
+
conda create -n chemlm_finetuning python=3.11
|
| 92 |
+
conda activate chemlm_finetuning
|
| 93 |
+
|
| 94 |
+
pip install -r requirements.txt
|
| 95 |
+
pip install torch transformers==4.57.3
|
| 96 |
+
pip install -U accelerate deepspeed
|
| 97 |
+
```
|
| 98 |
+
|
| 99 |
+
Then run fine-tuning, for example on BBBP:
|
| 100 |
+
|
| 101 |
+
```bash
|
| 102 |
+
python chemlm_pretraining/run_ft_molecule.py \
|
| 103 |
+
--model_name_or_path "sagawatatsuya/chemlm-2.30m" \
|
| 104 |
+
--tokenizer_name "ibm-research/MoLFormer-XL-both-10pct" \
|
| 105 |
+
--train_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/train.csv" \
|
| 106 |
+
--validation_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/valid.csv" \
|
| 107 |
+
--test_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/test.csv" \
|
| 108 |
+
--task_name "bbbp" \
|
| 109 |
+
--task_config "chemlm_pretraining/task_config.json" \
|
| 110 |
+
--do_train \
|
| 111 |
+
--do_eval \
|
| 112 |
+
--do_predict \
|
| 113 |
+
--per_device_train_batch_size 256 \
|
| 114 |
+
--per_device_eval_batch_size 512 \
|
| 115 |
+
--learning_rate 3e-5 \
|
| 116 |
+
--num_train_epochs 500 \
|
| 117 |
+
--save_strategy epoch \
|
| 118 |
+
--eval_strategy epoch
|
| 119 |
+
```
|
| 120 |
+
|
| 121 |
+
For linear probe evaluation, set:
|
| 122 |
+
|
| 123 |
+
```bash
|
| 124 |
+
--training_type "linear_probe"
|
| 125 |
+
```
|
| 126 |
+
|
| 127 |
+
In this mode, the script freezes the parameters under `model.bert` and trains only the task-specific head.
|
| 128 |
+
|
| 129 |
+
## Loading the Model in Python
|
| 130 |
+
|
| 131 |
+
The downstream script loads local converted checkpoints with the custom config and model class:
|
| 132 |
+
|
| 133 |
+
```python
|
| 134 |
+
import json
|
| 135 |
+
from argparse import Namespace
|
| 136 |
+
|
| 137 |
+
from transformers import AutoTokenizer
|
| 138 |
+
from pretraining.configs import PretrainedBertConfig
|
| 139 |
+
from pretraining.modeling import BertForSequenceClassificationMolecule
|
| 140 |
+
|
| 141 |
+
model_name_or_path = "sagawatatsuya/chemlm-2.30m"
|
| 142 |
+
task_name = "bbbp"
|
| 143 |
+
task_config_path = "chemlm_pretraining/task_config.json"
|
| 144 |
+
|
| 145 |
+
task_to_keys = json.load(open(task_config_path))
|
| 146 |
+
task_info = task_to_keys[task_name]
|
| 147 |
+
|
| 148 |
+
if task_info["task_category"] == "regression":
|
| 149 |
+
num_labels = 1
|
| 150 |
+
elif task_info["task_category"] == "classification":
|
| 151 |
+
num_labels = 2
|
| 152 |
+
else:
|
| 153 |
+
num_labels = len(task_info["target_columns"])
|
| 154 |
+
|
| 155 |
+
pretrain_run_args = json.load(open(f"{model_name_or_path}/args.json"))
|
| 156 |
+
ds_args = Namespace(**pretrain_run_args)
|
| 157 |
+
|
| 158 |
+
config = PretrainedBertConfig.from_pretrained(
|
| 159 |
+
model_name_or_path,
|
| 160 |
+
num_labels=num_labels,
|
| 161 |
+
finetuning_task=task_name,
|
| 162 |
+
layer_norm_type="pytorch",
|
| 163 |
+
task_category=task_info["task_category"],
|
| 164 |
+
fused_linear_layer=True,
|
| 165 |
+
max_seq_length=task_info["max_seq_length"],
|
| 166 |
+
)
|
| 167 |
+
|
| 168 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
| 169 |
+
"ibm-research/MoLFormer-XL-both-10pct",
|
| 170 |
+
trust_remote_code=True,
|
| 171 |
+
)
|
| 172 |
+
|
| 173 |
+
model = BertForSequenceClassificationMolecule.from_pretrained(
|
| 174 |
+
model_name_or_path,
|
| 175 |
+
config=config,
|
| 176 |
+
args=ds_args,
|
| 177 |
+
)
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
For normal use, `run_ft_molecule.py` is recommended instead of manually writing this loading code.
|
| 181 |
+
|
| 182 |
+
## Data
|
| 183 |
+
|
| 184 |
+
The pre-training data preparation follows the repository pipeline:
|
| 185 |
+
|
| 186 |
+
1. download ZINC-15 and PubChem SMILES used in MoLFormer-style pre-training,
|
| 187 |
+
2. preprocess, shard, and split the dataset,
|
| 188 |
+
3. create masked language modeling samples using `ibm-research/MoLFormer-XL-both-10pct`.
|
| 189 |
+
|
| 190 |
+
## Pre-training Objective
|
| 191 |
+
|
| 192 |
+
The model was pre-trained with masked language modeling.
|
| 193 |
+
|
| 194 |
+
The sample generation configuration uses:
|
| 195 |
+
|
| 196 |
+
| Setting | Value |
|
| 197 |
+
| -------------------------------- | -------------------------------------: |
|
| 198 |
+
| Masked LM probability | 0.15 |
|
| 199 |
+
| Maximum sequence length | 512 |
|
| 200 |
+
| Maximum predictions per sequence | 77 |
|
| 201 |
+
| Tokenizer | `ibm-research/MoLFormer-XL-both-10pct` |
|
| 202 |
+
|
| 203 |
+
## Downstream Tasks
|
| 204 |
+
|
| 205 |
+
The repository defines task metadata in `chemlm_pretraining/task_config.json`.
|
| 206 |
+
|
| 207 |
+
Supported task categories include:
|
| 208 |
+
|
| 209 |
+
* binary classification, e.g. BBBP, BACE, HIV
|
| 210 |
+
* multitask classification, e.g. Tox21, ClinTox, SIDER
|
| 211 |
+
* regression, e.g. QM9 targets, ESOL, FreeSolv, Lipophilicity
|
| 212 |
+
|
| 213 |
+
Metrics are selected from the task config:
|
| 214 |
+
|
| 215 |
+
| Task category | Metric |
|
| 216 |
+
| ------------------------ | ----------------------------------------- |
|
| 217 |
+
| Classification | ROC-AUC |
|
| 218 |
+
| Multitask classification | Mean ROC-AUC over tasks |
|
| 219 |
+
| Regression | MAE or RMSE, depending on the task config |
|
| 220 |
+
|
| 221 |
+
## Citation
|
| 222 |
+
|
| 223 |
+
If you use this model, please cite:
|
| 224 |
+
|
| 225 |
+
```bibtex
|
| 226 |
+
@misc{sagawa2026largescalechemicallanguagemodels,
|
| 227 |
+
title={How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?},
|
| 228 |
+
author={Tatsuya Sagawa and Ryosuke Kojima},
|
| 229 |
+
year={2026},
|
| 230 |
+
eprint={2602.11618},
|
| 231 |
+
archivePrefix={arXiv},
|
| 232 |
+
primaryClass={cs.LG},
|
| 233 |
+
url={https://arxiv.org/abs/2602.11618},
|
| 234 |
+
}
|
| 235 |
+
```
|