MolScaleTransfer ChemLM 25.75M
MolScaleTransfer ChemLM 25.75M is a small BERT-style chemical language model pre-trained with masked language modeling on molecular SMILES strings.
This checkpoint is intended to be used as an encoder initialization for downstream molecular property prediction tasks, and as one point in a model-scaling study of chemical language model transfer.
- Repository: https://github.com/sagawatatsuya/MolScaleTransfer
- Paper: How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?
- Tokenizer:
ibm-research/MoLFormer-XL-both-10pct - Model size: 25.75M parameters
Model Details
| Hyperparameter | Value |
|---|---|
| Hidden size | 512 |
| Number of hidden layers | 8 |
| Number of attention heads | 8 |
| Intermediate size | 2048 |
| Vocabulary size | 2362 |
| Maximum sequence length during pre-training | 512 |
The model was pre-trained using the Academic Budget BERT-based implementation in the repository and converted from a DeepSpeed checkpoint to Hugging Face format.
Intended Use
This checkpoint is intended for:
- downstream molecular property prediction by fine-tuning or linear probing,
- checkpoint-to-checkpoint comparison in scaling experiments,
- relating pre-training loss to downstream transfer performance.
For downstream molecular property prediction, load it with the custom model class defined in the repository:
from molscaletransfer.pretraining.configs import PretrainedBertConfig
from molscaletransfer.pretraining.modeling import BertForSequenceClassificationMolecule
The downstream model adds a task-specific classification or regression head on top of the pre-trained BERT encoder.
The supported training modes are:
- full fine-tuning
- linear probe, where the BERT encoder is frozen and only the prediction head is trained
Model Architecture for Downstream Tasks
For downstream molecular property prediction, the repository uses:
BertForSequenceClassificationMolecule
This model consists of:
- a pre-trained
BertModelencoder, BertPoolerC, which mean-pools non-special tokens using the attention mask,- dropout,
- a linear prediction head:
self.classifier = nn.Linear(config.hidden_size, self.num_labels)
The output is a Hugging Face SequenceClassifierOutput containing loss and logits.
The loss function depends on the task type:
| Task type | Number of labels | Loss |
|---|---|---|
| Regression | 1 | MSELoss |
| Binary classification | 2 | CrossEntropyLoss |
| Multitask classification | Number of target columns | Masked BCEWithLogitsLoss |
Usage
The intended usage is to load the pre-trained checkpoint inside the repository's downstream training script.
First, clone the repository and install the transfer-evaluation environment:
git clone https://github.com/sagawatatsuya/MolScaleTransfer.git
cd MolScaleTransfer
conda create -n molscaletransfer_transfer python=3.11
conda activate molscaletransfer_transfer
pip install -r requirements.txt
pip install torch transformers==4.57.3
pip install -U accelerate deepspeed
Then run fine-tuning, for example on BBBP:
python -m molscaletransfer.transfer.run_ft_molecule \
--model_name_or_path "sagawatatsuya/molscaletransfer-chemlm-25.75m" \
--tokenizer_name "ibm-research/MoLFormer-XL-both-10pct" \
--train_file "./molscaletransfer/dataset/finetune_datasets/data/bbbp/train.csv" \
--validation_file "./molscaletransfer/dataset/finetune_datasets/data/bbbp/valid.csv" \
--test_file "./molscaletransfer/dataset/finetune_datasets/data/bbbp/test.csv" \
--task_name "bbbp" \
--task_config "molscaletransfer/task_config.json" \
--do_train \
--do_eval \
--do_predict \
--per_device_train_batch_size 256 \
--per_device_eval_batch_size 512 \
--learning_rate 3e-5 \
--num_train_epochs 500 \
--save_strategy epoch \
--eval_strategy epoch
For linear probe evaluation, set:
--training_type "linear_probe"
In this mode, the script freezes the parameters under model.bert and trains only the task-specific head.
Loading the Model in Python
The downstream script loads local converted checkpoints with the custom config and model class:
import json
from argparse import Namespace
from transformers import AutoTokenizer
from molscaletransfer.pretraining.configs import PretrainedBertConfig
from molscaletransfer.pretraining.modeling import BertForSequenceClassificationMolecule
model_name_or_path = "sagawatatsuya/molscaletransfer-chemlm-25.75m"
task_name = "bbbp"
task_config_path = "molscaletransfer/task_config.json"
task_to_keys = json.load(open(task_config_path))
task_info = task_to_keys[task_name]
if task_info["task_category"] == "regression":
num_labels = 1
elif task_info["task_category"] == "classification":
num_labels = 2
else:
num_labels = len(task_info["target_columns"])
pretrain_run_args = json.load(open(f"{model_name_or_path}/args.json"))
ds_args = Namespace(**pretrain_run_args)
config = PretrainedBertConfig.from_pretrained(
model_name_or_path,
num_labels=num_labels,
finetuning_task=task_name,
layer_norm_type="pytorch",
task_category=task_info["task_category"],
fused_linear_layer=True,
max_seq_length=task_info["max_seq_length"],
)
tokenizer = AutoTokenizer.from_pretrained(
"ibm-research/MoLFormer-XL-both-10pct",
trust_remote_code=True,
)
model = BertForSequenceClassificationMolecule.from_pretrained(
model_name_or_path,
config=config,
args=ds_args,
)
For normal use, run_ft_molecule.py is recommended instead of manually writing this loading code.
Data
The pre-training data preparation follows the repository pipeline:
- download ZINC-15 and PubChem SMILES used in MoLFormer-style pre-training,
- preprocess, shard, and split the dataset,
- create masked language modeling samples using
ibm-research/MoLFormer-XL-both-10pct.
Pre-training Objective
The model was pre-trained with masked language modeling.
The sample generation configuration uses:
| Setting | Value |
|---|---|
| Masked LM probability | 0.15 |
| Maximum sequence length | 512 |
| Maximum predictions per sequence | 77 |
| Tokenizer | ibm-research/MoLFormer-XL-both-10pct |
Downstream Tasks
The repository defines task metadata in molscaletransfer/task_config.json.
Supported task categories include:
- binary classification, e.g. BBBP, BACE, HIV
- multitask classification, e.g. Tox21, ClinTox, SIDER
- regression, e.g. QM9 targets, ESOL, FreeSolv, Lipophilicity
Metrics are selected from the task config:
| Task category | Metric |
|---|---|
| Classification | ROC-AUC |
| Multitask classification | Mean ROC-AUC over tasks |
| Regression | MAE or RMSE, depending on the task config |
Relation to MolScaleTransfer
This checkpoint is one example of the model family used in MolScaleTransfer, a toolkit for evaluating the scaling behavior and transfer performance of chemical language models.
Within that workflow, checkpoints like this can be used for:
- pre-training loss evaluation,
- Hessian trace or PGM analysis after conversion to Hugging Face format,
- downstream fine-tuning and linear probe experiments,
- comparison against larger or smaller checkpoints in the same scaling series.
Citation
If you use this model, please cite:
@misc{sagawa2026largescalechemicallanguagemodels,
title={How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?},
author={Tatsuya Sagawa and Ryosuke Kojima},
year={2026},
eprint={2602.11618},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.11618},
}
- Downloads last month
- 13