sagawa commited on
Commit
7901fef
·
verified ·
1 Parent(s): db4ac1c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +235 -3
README.md CHANGED
@@ -1,3 +1,235 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags:
4
+ - chemistry
5
+ - smiles
6
+ - molecular-property-prediction
7
+ - masked-language-modeling
8
+ - bert
9
+ license: apache-2.0
10
+ ---
11
+
12
+ # ChemLM 2.30M
13
+
14
+ ChemLM 2.30M is a small BERT-style chemical language model pre-trained with masked language modeling on molecular SMILES strings.
15
+
16
+ This checkpoint is intended to be used as an encoder initialization for downstream molecular property prediction tasks through the fine-tuning and linear probe code in the original repository.
17
+
18
+ - Repository: https://github.com/sagawatatsuya/chemlm_pretraining
19
+ - Paper: *How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?*
20
+ - Tokenizer: `ibm-research/MoLFormer-XL-both-10pct`
21
+ - Model size: 2.30M parameters
22
+
23
+ ## Model Details
24
+
25
+ | Hyperparameter | Value |
26
+ |---|---:|
27
+ | Hidden size | 192 |
28
+ | Number of hidden layers | 5 |
29
+ | Number of attention heads | 3 |
30
+ | Intermediate size | 768 |
31
+ | Vocabulary size | 2362 |
32
+ | Maximum sequence length during pre-training | 512 |
33
+
34
+ The model was pre-trained using the Academic Budget BERT-based implementation in the repository and converted from a DeepSpeed checkpoint to Hugging Face format.
35
+
36
+ ## Intended Use
37
+
38
+ This checkpoint is intended for downstream molecular property prediction by loading it with the custom model class defined in the repository:
39
+
40
+ ```python
41
+ from pretraining.configs import PretrainedBertConfig
42
+ from pretraining.modeling import BertForSequenceClassificationMolecule
43
+ ````
44
+
45
+ The downstream model adds a task-specific classification or regression head on top of the pre-trained BERT encoder.
46
+
47
+ The supported training modes are:
48
+
49
+ * full fine-tuning
50
+ * linear probe, where the BERT encoder is frozen and only the prediction head is trained
51
+
52
+ ## Model Architecture for Downstream Tasks
53
+
54
+ For downstream molecular property prediction, the repository uses:
55
+
56
+ ```python
57
+ BertForSequenceClassificationMolecule
58
+ ```
59
+
60
+ This model consists of:
61
+
62
+ 1. a pre-trained `BertModel` encoder,
63
+ 2. `BertPoolerC`, which mean-pools non-special tokens using the attention mask,
64
+ 3. dropout,
65
+ 4. a linear prediction head:
66
+
67
+ ```python
68
+ self.classifier = nn.Linear(config.hidden_size, self.num_labels)
69
+ ```
70
+
71
+ The output is a Hugging Face `SequenceClassifierOutput` containing `loss` and `logits`.
72
+
73
+ The loss function depends on the task type:
74
+
75
+ | Task type | Number of labels | Loss |
76
+ | ------------------------ | -----------------------: | ------------------------ |
77
+ | Regression | 1 | MSELoss |
78
+ | Binary classification | 2 | CrossEntropyLoss |
79
+ | Multitask classification | Number of target columns | Masked BCEWithLogitsLoss |
80
+
81
+ ## Usage
82
+
83
+ The intended usage is to load the pre-trained checkpoint inside the repository's downstream training script.
84
+
85
+ First, clone the original repository and install the fine-tuning environment:
86
+
87
+ ```bash
88
+ git clone https://github.com/sagawatatsuya/chemlm_pretraining.git
89
+ cd chemlm_pretraining
90
+
91
+ conda create -n chemlm_finetuning python=3.11
92
+ conda activate chemlm_finetuning
93
+
94
+ pip install -r requirements.txt
95
+ pip install torch transformers==4.57.3
96
+ pip install -U accelerate deepspeed
97
+ ```
98
+
99
+ Then run fine-tuning, for example on BBBP:
100
+
101
+ ```bash
102
+ python chemlm_pretraining/run_ft_molecule.py \
103
+ --model_name_or_path "sagawatatsuya/chemlm-2.30m" \
104
+ --tokenizer_name "ibm-research/MoLFormer-XL-both-10pct" \
105
+ --train_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/train.csv" \
106
+ --validation_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/valid.csv" \
107
+ --test_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/test.csv" \
108
+ --task_name "bbbp" \
109
+ --task_config "chemlm_pretraining/task_config.json" \
110
+ --do_train \
111
+ --do_eval \
112
+ --do_predict \
113
+ --per_device_train_batch_size 256 \
114
+ --per_device_eval_batch_size 512 \
115
+ --learning_rate 3e-5 \
116
+ --num_train_epochs 500 \
117
+ --save_strategy epoch \
118
+ --eval_strategy epoch
119
+ ```
120
+
121
+ For linear probe evaluation, set:
122
+
123
+ ```bash
124
+ --training_type "linear_probe"
125
+ ```
126
+
127
+ In this mode, the script freezes the parameters under `model.bert` and trains only the task-specific head.
128
+
129
+ ## Loading the Model in Python
130
+
131
+ The downstream script loads local converted checkpoints with the custom config and model class:
132
+
133
+ ```python
134
+ import json
135
+ from argparse import Namespace
136
+
137
+ from transformers import AutoTokenizer
138
+ from pretraining.configs import PretrainedBertConfig
139
+ from pretraining.modeling import BertForSequenceClassificationMolecule
140
+
141
+ model_name_or_path = "sagawatatsuya/chemlm-2.30m"
142
+ task_name = "bbbp"
143
+ task_config_path = "chemlm_pretraining/task_config.json"
144
+
145
+ task_to_keys = json.load(open(task_config_path))
146
+ task_info = task_to_keys[task_name]
147
+
148
+ if task_info["task_category"] == "regression":
149
+ num_labels = 1
150
+ elif task_info["task_category"] == "classification":
151
+ num_labels = 2
152
+ else:
153
+ num_labels = len(task_info["target_columns"])
154
+
155
+ pretrain_run_args = json.load(open(f"{model_name_or_path}/args.json"))
156
+ ds_args = Namespace(**pretrain_run_args)
157
+
158
+ config = PretrainedBertConfig.from_pretrained(
159
+ model_name_or_path,
160
+ num_labels=num_labels,
161
+ finetuning_task=task_name,
162
+ layer_norm_type="pytorch",
163
+ task_category=task_info["task_category"],
164
+ fused_linear_layer=True,
165
+ max_seq_length=task_info["max_seq_length"],
166
+ )
167
+
168
+ tokenizer = AutoTokenizer.from_pretrained(
169
+ "ibm-research/MoLFormer-XL-both-10pct",
170
+ trust_remote_code=True,
171
+ )
172
+
173
+ model = BertForSequenceClassificationMolecule.from_pretrained(
174
+ model_name_or_path,
175
+ config=config,
176
+ args=ds_args,
177
+ )
178
+ ```
179
+
180
+ For normal use, `run_ft_molecule.py` is recommended instead of manually writing this loading code.
181
+
182
+ ## Data
183
+
184
+ The pre-training data preparation follows the repository pipeline:
185
+
186
+ 1. download ZINC-15 and PubChem SMILES used in MoLFormer-style pre-training,
187
+ 2. preprocess, shard, and split the dataset,
188
+ 3. create masked language modeling samples using `ibm-research/MoLFormer-XL-both-10pct`.
189
+
190
+ ## Pre-training Objective
191
+
192
+ The model was pre-trained with masked language modeling.
193
+
194
+ The sample generation configuration uses:
195
+
196
+ | Setting | Value |
197
+ | -------------------------------- | -------------------------------------: |
198
+ | Masked LM probability | 0.15 |
199
+ | Maximum sequence length | 512 |
200
+ | Maximum predictions per sequence | 77 |
201
+ | Tokenizer | `ibm-research/MoLFormer-XL-both-10pct` |
202
+
203
+ ## Downstream Tasks
204
+
205
+ The repository defines task metadata in `chemlm_pretraining/task_config.json`.
206
+
207
+ Supported task categories include:
208
+
209
+ * binary classification, e.g. BBBP, BACE, HIV
210
+ * multitask classification, e.g. Tox21, ClinTox, SIDER
211
+ * regression, e.g. QM9 targets, ESOL, FreeSolv, Lipophilicity
212
+
213
+ Metrics are selected from the task config:
214
+
215
+ | Task category | Metric |
216
+ | ------------------------ | ----------------------------------------- |
217
+ | Classification | ROC-AUC |
218
+ | Multitask classification | Mean ROC-AUC over tasks |
219
+ | Regression | MAE or RMSE, depending on the task config |
220
+
221
+ ## Citation
222
+
223
+ If you use this model, please cite:
224
+
225
+ ```bibtex
226
+ @misc{sagawa2026largescalechemicallanguagemodels,
227
+ title={How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?},
228
+ author={Tatsuya Sagawa and Ryosuke Kojima},
229
+ year={2026},
230
+ eprint={2602.11618},
231
+ archivePrefix={arXiv},
232
+ primaryClass={cs.LG},
233
+ url={https://arxiv.org/abs/2602.11618},
234
+ }
235
+ ```