sagawa
/

chemlm-2.30m

+---
+library_name: transformers
+tags:
+- chemistry
+- smiles
+- molecular-property-prediction
+- masked-language-modeling
+- bert
+license: apache-2.0
+---
+# ChemLM 2.30M
+ChemLM 2.30M is a small BERT-style chemical language model pre-trained with masked language modeling on molecular SMILES strings.
+This checkpoint is intended to be used as an encoder initialization for downstream molecular property prediction tasks through the fine-tuning and linear probe code in the original repository.
+- Repository: https://github.com/sagawatatsuya/chemlm_pretraining
+- Paper: *How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?*
+- Tokenizer: `ibm-research/MoLFormer-XL-both-10pct`
+- Model size: 2.30M parameters
+## Model Details
+| Hyperparameter | Value |
+|---|---:|
+| Hidden size | 192 |
+| Number of hidden layers | 5 |
+| Number of attention heads | 3 |
+| Intermediate size | 768 |
+| Vocabulary size | 2362 |
+| Maximum sequence length during pre-training | 512 |
+The model was pre-trained using the Academic Budget BERT-based implementation in the repository and converted from a DeepSpeed checkpoint to Hugging Face format.
+## Intended Use
+This checkpoint is intended for downstream molecular property prediction by loading it with the custom model class defined in the repository:
+```python
+from pretraining.configs import PretrainedBertConfig
+from pretraining.modeling import BertForSequenceClassificationMolecule
+````
+The downstream model adds a task-specific classification or regression head on top of the pre-trained BERT encoder.
+The supported training modes are:
+* full fine-tuning
+* linear probe, where the BERT encoder is frozen and only the prediction head is trained
+## Model Architecture for Downstream Tasks
+For downstream molecular property prediction, the repository uses:
+```python
+BertForSequenceClassificationMolecule
+```
+This model consists of:
+1. a pre-trained `BertModel` encoder,
+2. `BertPoolerC`, which mean-pools non-special tokens using the attention mask,
+3. dropout,
+4. a linear prediction head:
+```python
+self.classifier = nn.Linear(config.hidden_size, self.num_labels)
+```
+The output is a Hugging Face `SequenceClassifierOutput` containing `loss` and `logits`.
+The loss function depends on the task type:
+| Task type                |         Number of labels | Loss                     |
+| ------------------------ | -----------------------: | ------------------------ |
+| Regression               |                        1 | MSELoss                  |
+| Binary classification    |                        2 | CrossEntropyLoss         |
+| Multitask classification | Number of target columns | Masked BCEWithLogitsLoss |
+## Usage
+The intended usage is to load the pre-trained checkpoint inside the repository's downstream training script.
+First, clone the original repository and install the fine-tuning environment:
+```bash
+git clone https://github.com/sagawatatsuya/chemlm_pretraining.git
+cd chemlm_pretraining
+conda create -n chemlm_finetuning python=3.11
+conda activate chemlm_finetuning
+pip install -r requirements.txt
+pip install torch transformers==4.57.3
+pip install -U accelerate deepspeed
+```
+Then run fine-tuning, for example on BBBP:
+```bash
+python chemlm_pretraining/run_ft_molecule.py \
+  --model_name_or_path "sagawatatsuya/chemlm-2.30m" \
+  --tokenizer_name "ibm-research/MoLFormer-XL-both-10pct" \
+  --train_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/train.csv" \
+  --validation_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/valid.csv" \
+  --test_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/test.csv" \
+  --task_name "bbbp" \
+  --task_config "chemlm_pretraining/task_config.json" \
+  --do_train \
+  --do_eval \
+  --do_predict \
+  --per_device_train_batch_size 256 \
+  --per_device_eval_batch_size 512 \
+  --learning_rate 3e-5 \
+  --num_train_epochs 500 \
+  --save_strategy epoch \
+  --eval_strategy epoch
+```
+For linear probe evaluation, set:
+```bash
+--training_type "linear_probe"
+```
+In this mode, the script freezes the parameters under `model.bert` and trains only the task-specific head.
+## Loading the Model in Python
+The downstream script loads local converted checkpoints with the custom config and model class:
+```python
+import json
+from argparse import Namespace
+from transformers import AutoTokenizer
+from pretraining.configs import PretrainedBertConfig
+from pretraining.modeling import BertForSequenceClassificationMolecule
+model_name_or_path = "sagawatatsuya/chemlm-2.30m"
+task_name = "bbbp"
+task_config_path = "chemlm_pretraining/task_config.json"
+task_to_keys = json.load(open(task_config_path))
+task_info = task_to_keys[task_name]
+if task_info["task_category"] == "regression":
+    num_labels = 1
+elif task_info["task_category"] == "classification":
+    num_labels = 2
+else:
+    num_labels = len(task_info["target_columns"])
+pretrain_run_args = json.load(open(f"{model_name_or_path}/args.json"))
+ds_args = Namespace(**pretrain_run_args)
+config = PretrainedBertConfig.from_pretrained(
+    model_name_or_path,
+    num_labels=num_labels,
+    finetuning_task=task_name,
+    layer_norm_type="pytorch",
+    task_category=task_info["task_category"],
+    fused_linear_layer=True,
+    max_seq_length=task_info["max_seq_length"],
+)
+tokenizer = AutoTokenizer.from_pretrained(
+    "ibm-research/MoLFormer-XL-both-10pct",
+    trust_remote_code=True,
+)
+model = BertForSequenceClassificationMolecule.from_pretrained(
+    model_name_or_path,
+    config=config,
+    args=ds_args,
+)
+```
+For normal use, `run_ft_molecule.py` is recommended instead of manually writing this loading code.
+## Data
+The pre-training data preparation follows the repository pipeline:
+1. download ZINC-15 and PubChem SMILES used in MoLFormer-style pre-training,
+2. preprocess, shard, and split the dataset,
+3. create masked language modeling samples using `ibm-research/MoLFormer-XL-both-10pct`.
+## Pre-training Objective
+The model was pre-trained with masked language modeling.
+The sample generation configuration uses:
+| Setting                          |                                  Value |
+| -------------------------------- | -------------------------------------: |
+| Masked LM probability            |                                   0.15 |
+| Maximum sequence length          |                                    512 |
+| Maximum predictions per sequence |                                     77 |
+| Tokenizer                        | `ibm-research/MoLFormer-XL-both-10pct` |
+## Downstream Tasks
+The repository defines task metadata in `chemlm_pretraining/task_config.json`.
+Supported task categories include:
+* binary classification, e.g. BBBP, BACE, HIV
+* multitask classification, e.g. Tox21, ClinTox, SIDER
+* regression, e.g. QM9 targets, ESOL, FreeSolv, Lipophilicity
+Metrics are selected from the task config:
+| Task category            | Metric                                    |
+| ------------------------ | ----------------------------------------- |
+| Classification           | ROC-AUC                                   |
+| Multitask classification | Mean ROC-AUC over tasks                   |
+| Regression               | MAE or RMSE, depending on the task config |
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{sagawa2026largescalechemicallanguagemodels,
+      title={How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?},
+      author={Tatsuya Sagawa and Ryosuke Kojima},
+      year={2026},
+      eprint={2602.11618},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2602.11618},
+}
+```