File size: 7,652 Bytes

7901fef

---
library_name: transformers
tags:
- chemistry
- smiles
- molecular-property-prediction
- masked-language-modeling
- bert
license: apache-2.0
---

# ChemLM 2.30M

ChemLM 2.30M is a small BERT-style chemical language model pre-trained with masked language modeling on molecular SMILES strings.

This checkpoint is intended to be used as an encoder initialization for downstream molecular property prediction tasks through the fine-tuning and linear probe code in the original repository.

- Repository: https://github.com/sagawatatsuya/chemlm_pretraining
- Paper: *How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?*
- Tokenizer: `ibm-research/MoLFormer-XL-both-10pct`
- Model size: 2.30M parameters

## Model Details

| Hyperparameter | Value |
|---|---:|
| Hidden size | 192 |
| Number of hidden layers | 5 |
| Number of attention heads | 3 |
| Intermediate size | 768 |
| Vocabulary size | 2362 |
| Maximum sequence length during pre-training | 512 |

The model was pre-trained using the Academic Budget BERT-based implementation in the repository and converted from a DeepSpeed checkpoint to Hugging Face format.

## Intended Use

This checkpoint is intended for downstream molecular property prediction by loading it with the custom model class defined in the repository:

```python
from pretraining.configs import PretrainedBertConfig
from pretraining.modeling import BertForSequenceClassificationMolecule
````

The downstream model adds a task-specific classification or regression head on top of the pre-trained BERT encoder.

The supported training modes are:

* full fine-tuning
* linear probe, where the BERT encoder is frozen and only the prediction head is trained

## Model Architecture for Downstream Tasks

For downstream molecular property prediction, the repository uses:

```python
BertForSequenceClassificationMolecule
```

This model consists of:

1. a pre-trained `BertModel` encoder,
2. `BertPoolerC`, which mean-pools non-special tokens using the attention mask,
3. dropout,
4. a linear prediction head:

```python
self.classifier = nn.Linear(config.hidden_size, self.num_labels)
```

The output is a Hugging Face `SequenceClassifierOutput` containing `loss` and `logits`.

The loss function depends on the task type:

| Task type                |         Number of labels | Loss                     |
| ------------------------ | -----------------------: | ------------------------ |
| Regression               |                        1 | MSELoss                  |
| Binary classification    |                        2 | CrossEntropyLoss         |
| Multitask classification | Number of target columns | Masked BCEWithLogitsLoss |

## Usage

The intended usage is to load the pre-trained checkpoint inside the repository's downstream training script.

First, clone the original repository and install the fine-tuning environment:

```bash
git clone https://github.com/sagawatatsuya/chemlm_pretraining.git
cd chemlm_pretraining

conda create -n chemlm_finetuning python=3.11
conda activate chemlm_finetuning

pip install -r requirements.txt
pip install torch transformers==4.57.3
pip install -U accelerate deepspeed
```

Then run fine-tuning, for example on BBBP:

```bash
python chemlm_pretraining/run_ft_molecule.py \
  --model_name_or_path "sagawatatsuya/chemlm-2.30m" \
  --tokenizer_name "ibm-research/MoLFormer-XL-both-10pct" \
  --train_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/train.csv" \
  --validation_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/valid.csv" \
  --test_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/test.csv" \
  --task_name "bbbp" \
  --task_config "chemlm_pretraining/task_config.json" \
  --do_train \
  --do_eval \
  --do_predict \
  --per_device_train_batch_size 256 \
  --per_device_eval_batch_size 512 \
  --learning_rate 3e-5 \
  --num_train_epochs 500 \
  --save_strategy epoch \
  --eval_strategy epoch
```

For linear probe evaluation, set:

```bash
--training_type "linear_probe"
```

In this mode, the script freezes the parameters under `model.bert` and trains only the task-specific head.

## Loading the Model in Python

The downstream script loads local converted checkpoints with the custom config and model class:

```python
import json
from argparse import Namespace

from transformers import AutoTokenizer
from pretraining.configs import PretrainedBertConfig
from pretraining.modeling import BertForSequenceClassificationMolecule

model_name_or_path = "sagawatatsuya/chemlm-2.30m"
task_name = "bbbp"
task_config_path = "chemlm_pretraining/task_config.json"

task_to_keys = json.load(open(task_config_path))
task_info = task_to_keys[task_name]

if task_info["task_category"] == "regression":
    num_labels = 1
elif task_info["task_category"] == "classification":
    num_labels = 2
else:
    num_labels = len(task_info["target_columns"])

pretrain_run_args = json.load(open(f"{model_name_or_path}/args.json"))
ds_args = Namespace(**pretrain_run_args)

config = PretrainedBertConfig.from_pretrained(
    model_name_or_path,
    num_labels=num_labels,
    finetuning_task=task_name,
    layer_norm_type="pytorch",
    task_category=task_info["task_category"],
    fused_linear_layer=True,
    max_seq_length=task_info["max_seq_length"],
)

tokenizer = AutoTokenizer.from_pretrained(
    "ibm-research/MoLFormer-XL-both-10pct",
    trust_remote_code=True,
)

model = BertForSequenceClassificationMolecule.from_pretrained(
    model_name_or_path,
    config=config,
    args=ds_args,
)
```

For normal use, `run_ft_molecule.py` is recommended instead of manually writing this loading code.

## Data

The pre-training data preparation follows the repository pipeline:

1. download ZINC-15 and PubChem SMILES used in MoLFormer-style pre-training,
2. preprocess, shard, and split the dataset,
3. create masked language modeling samples using `ibm-research/MoLFormer-XL-both-10pct`.

## Pre-training Objective

The model was pre-trained with masked language modeling.

The sample generation configuration uses:

| Setting                          |                                  Value |
| -------------------------------- | -------------------------------------: |
| Masked LM probability            |                                   0.15 |
| Maximum sequence length          |                                    512 |
| Maximum predictions per sequence |                                     77 |
| Tokenizer                        | `ibm-research/MoLFormer-XL-both-10pct` |

## Downstream Tasks

The repository defines task metadata in `chemlm_pretraining/task_config.json`.

Supported task categories include:

* binary classification, e.g. BBBP, BACE, HIV
* multitask classification, e.g. Tox21, ClinTox, SIDER
* regression, e.g. QM9 targets, ESOL, FreeSolv, Lipophilicity

Metrics are selected from the task config:

| Task category            | Metric                                    |
| ------------------------ | ----------------------------------------- |
| Classification           | ROC-AUC                                   |
| Multitask classification | Mean ROC-AUC over tasks                   |
| Regression               | MAE or RMSE, depending on the task config |

## Citation

If you use this model, please cite:

```bibtex
@misc{sagawa2026largescalechemicallanguagemodels,
      title={How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?},
      author={Tatsuya Sagawa and Ryosuke Kojima},
      year={2026},
      eprint={2602.11618},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.11618},
}
```