MolScaleTransfer ChemLM 2.30M

MolScaleTransfer ChemLM 2.30M is a small BERT-style chemical language model pre-trained with masked language modeling on molecular SMILES strings.

This checkpoint is intended to be used as an encoder initialization for downstream molecular property prediction tasks, and as one point in a model-scaling study of chemical language model transfer.

Model Details

Hyperparameter Value
Hidden size 192
Number of hidden layers 5
Number of attention heads 3
Intermediate size 768
Vocabulary size 2362
Maximum sequence length during pre-training 512

The model was pre-trained using the Academic Budget BERT-based implementation in the repository and converted from a DeepSpeed checkpoint to Hugging Face format.

Intended Use

This checkpoint is intended for:

  • downstream molecular property prediction by fine-tuning or linear probing,
  • checkpoint-to-checkpoint comparison in scaling experiments,
  • relating pre-training loss to downstream transfer performance.

For downstream molecular property prediction, load it with the custom model class defined in the repository:

from molscaletransfer.pretraining.configs import PretrainedBertConfig
from molscaletransfer.pretraining.modeling import BertForSequenceClassificationMolecule

The downstream model adds a task-specific classification or regression head on top of the pre-trained BERT encoder.

The supported training modes are:

  • full fine-tuning
  • linear probe, where the BERT encoder is frozen and only the prediction head is trained

Model Architecture for Downstream Tasks

For downstream molecular property prediction, the repository uses:

BertForSequenceClassificationMolecule

This model consists of:

  1. a pre-trained BertModel encoder,
  2. BertPoolerC, which mean-pools non-special tokens using the attention mask,
  3. dropout,
  4. a linear prediction head:
self.classifier = nn.Linear(config.hidden_size, self.num_labels)

The output is a Hugging Face SequenceClassifierOutput containing loss and logits.

The loss function depends on the task type:

Task type Number of labels Loss
Regression 1 MSELoss
Binary classification 2 CrossEntropyLoss
Multitask classification Number of target columns Masked BCEWithLogitsLoss

Usage

The intended usage is to load the pre-trained checkpoint inside the repository's downstream training script.

First, clone the repository and install the transfer-evaluation environment:

git clone https://github.com/sagawatatsuya/MolScaleTransfer.git
cd MolScaleTransfer

conda create -n molscaletransfer_transfer python=3.11
conda activate molscaletransfer_transfer

pip install -r requirements.txt
pip install torch transformers==4.57.3
pip install -U accelerate deepspeed

Then run fine-tuning, for example on BBBP:

python -m molscaletransfer.transfer.run_ft_molecule \
  --model_name_or_path "sagawatatsuya/molscaletransfer-chemlm-2.30m" \
  --tokenizer_name "ibm-research/MoLFormer-XL-both-10pct" \
  --train_file "./molscaletransfer/dataset/finetune_datasets/data/bbbp/train.csv" \
  --validation_file "./molscaletransfer/dataset/finetune_datasets/data/bbbp/valid.csv" \
  --test_file "./molscaletransfer/dataset/finetune_datasets/data/bbbp/test.csv" \
  --task_name "bbbp" \
  --task_config "molscaletransfer/task_config.json" \
  --do_train \
  --do_eval \
  --do_predict \
  --per_device_train_batch_size 256 \
  --per_device_eval_batch_size 512 \
  --learning_rate 3e-5 \
  --num_train_epochs 500 \
  --save_strategy epoch \
  --eval_strategy epoch

For linear probe evaluation, set:

--training_type "linear_probe"

In this mode, the script freezes the parameters under model.bert and trains only the task-specific head.

Loading the Model in Python

The downstream script loads local converted checkpoints with the custom config and model class:

import json
from argparse import Namespace

from transformers import AutoTokenizer
from molscaletransfer.pretraining.configs import PretrainedBertConfig
from molscaletransfer.pretraining.modeling import BertForSequenceClassificationMolecule

model_name_or_path = "sagawatatsuya/molscaletransfer-chemlm-2.30m"
task_name = "bbbp"
task_config_path = "molscaletransfer/task_config.json"

task_to_keys = json.load(open(task_config_path))
task_info = task_to_keys[task_name]

if task_info["task_category"] == "regression":
    num_labels = 1
elif task_info["task_category"] == "classification":
    num_labels = 2
else:
    num_labels = len(task_info["target_columns"])

pretrain_run_args = json.load(open(f"{model_name_or_path}/args.json"))
ds_args = Namespace(**pretrain_run_args)

config = PretrainedBertConfig.from_pretrained(
    model_name_or_path,
    num_labels=num_labels,
    finetuning_task=task_name,
    layer_norm_type="pytorch",
    task_category=task_info["task_category"],
    fused_linear_layer=True,
    max_seq_length=task_info["max_seq_length"],
)

tokenizer = AutoTokenizer.from_pretrained(
    "ibm-research/MoLFormer-XL-both-10pct",
    trust_remote_code=True,
)

model = BertForSequenceClassificationMolecule.from_pretrained(
    model_name_or_path,
    config=config,
    args=ds_args,
)

For normal use, run_ft_molecule.py is recommended instead of manually writing this loading code.

Data

The pre-training data preparation follows the repository pipeline:

  1. download ZINC-15 and PubChem SMILES used in MoLFormer-style pre-training,
  2. preprocess, shard, and split the dataset,
  3. create masked language modeling samples using ibm-research/MoLFormer-XL-both-10pct.

Pre-training Objective

The model was pre-trained with masked language modeling.

The sample generation configuration uses:

Setting Value
Masked LM probability 0.15
Maximum sequence length 512
Maximum predictions per sequence 77
Tokenizer ibm-research/MoLFormer-XL-both-10pct

Downstream Tasks

The repository defines task metadata in molscaletransfer/task_config.json.

Supported task categories include:

  • binary classification, e.g. BBBP, BACE, HIV
  • multitask classification, e.g. Tox21, ClinTox, SIDER
  • regression, e.g. QM9 targets, ESOL, FreeSolv, Lipophilicity

Metrics are selected from the task config:

Task category Metric
Classification ROC-AUC
Multitask classification Mean ROC-AUC over tasks
Regression MAE or RMSE, depending on the task config

Relation to MolScaleTransfer

This checkpoint is one example of the model family used in MolScaleTransfer, a toolkit for evaluating the scaling behavior and transfer performance of chemical language models.

Within that workflow, checkpoints like this can be used for:

  • pre-training loss evaluation,
  • Hessian trace or PGM analysis after conversion to Hugging Face format,
  • downstream fine-tuning and linear probe experiments,
  • comparison against larger or smaller checkpoints in the same scaling series.

Citation

If you use this model, please cite:

@misc{sagawa2026largescalechemicallanguagemodels,
      title={How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?},
      author={Tatsuya Sagawa and Ryosuke Kojima},
      year={2026},
      eprint={2602.11618},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.11618},
}
Downloads last month
558
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including sagawa/molscaletransfer-chemlm-2.30m

Paper for sagawa/molscaletransfer-chemlm-2.30m