MolScaleTransfer ChemLM 25.75M

MolScaleTransfer ChemLM 25.75M is a small BERT-style chemical language model pre-trained with masked language modeling on molecular SMILES strings.

This checkpoint is intended to be used as an encoder initialization for downstream molecular property prediction tasks, and as one point in a model-scaling study of chemical language model transfer.

Repository: https://github.com/sagawatatsuya/MolScaleTransfer
Paper: How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?
Tokenizer: ibm-research/MoLFormer-XL-both-10pct
Model size: 25.75M parameters

Model Details

Hyperparameter	Value
Hidden size	512
Number of hidden layers	8
Number of attention heads	8
Intermediate size	2048
Vocabulary size	2362
Maximum sequence length during pre-training	512

The model was pre-trained using the Academic Budget BERT-based implementation in the repository and converted from a DeepSpeed checkpoint to Hugging Face format.

Intended Use

This checkpoint is intended for:

downstream molecular property prediction by fine-tuning or linear probing,
checkpoint-to-checkpoint comparison in scaling experiments,
relating pre-training loss to downstream transfer performance.

For downstream molecular property prediction, load it with the custom model class defined in the repository:

from molscaletransfer.pretraining.configs import PretrainedBertConfig
from molscaletransfer.pretraining.modeling import BertForSequenceClassificationMolecule

The downstream model adds a task-specific classification or regression head on top of the pre-trained BERT encoder.

The supported training modes are:

full fine-tuning
linear probe, where the BERT encoder is frozen and only the prediction head is trained

Model Architecture for Downstream Tasks

For downstream molecular property prediction, the repository uses:

BertForSequenceClassificationMolecule

This model consists of:

a pre-trained BertModel encoder,
BertPoolerC, which mean-pools non-special tokens using the attention mask,
dropout,
a linear prediction head:

self.classifier = nn.Linear(config.hidden_size, self.num_labels)

The output is a Hugging Face SequenceClassifierOutput containing loss and logits.

The loss function depends on the task type:

Task type	Number of labels	Loss
Regression	1	MSELoss
Binary classification	2	CrossEntropyLoss
Multitask classification	Number of target columns	Masked BCEWithLogitsLoss

Usage

The intended usage is to load the pre-trained checkpoint inside the repository's downstream training script.

First, clone the repository and install the transfer-evaluation environment:

git clone https://github.com/sagawatatsuya/MolScaleTransfer.git
cd MolScaleTransfer

conda create -n molscaletransfer_transfer python=3.11
conda activate molscaletransfer_transfer

pip install -r requirements.txt
pip install torch transformers==4.57.3
pip install -U accelerate deepspeed

Then run fine-tuning, for example on BBBP:

python -m molscaletransfer.transfer.run_ft_molecule \
  --model_name_or_path "sagawatatsuya/molscaletransfer-chemlm-25.75m" \
  --tokenizer_name "ibm-research/MoLFormer-XL-both-10pct" \
  --train_file "./molscaletransfer/dataset/finetune_datasets/data/bbbp/train.csv" \
  --validation_file "./molscaletransfer/dataset/finetune_datasets/data/bbbp/valid.csv" \
  --test_file "./molscaletransfer/dataset/finetune_datasets/data/bbbp/test.csv" \
  --task_name "bbbp" \
  --task_config "molscaletransfer/task_config.json" \
  --do_train \
  --do_eval \
  --do_predict \
  --per_device_train_batch_size 256 \
  --per_device_eval_batch_size 512 \
  --learning_rate 3e-5 \
  --num_train_epochs 500 \
  --save_strategy epoch \
  --eval_strategy epoch

For linear probe evaluation, set:

--training_type "linear_probe"

In this mode, the script freezes the parameters under model.bert and trains only the task-specific head.

Loading the Model in Python

The downstream script loads local converted checkpoints with the custom config and model class:

import json
from argparse import Namespace

from transformers import AutoTokenizer
from molscaletransfer.pretraining.configs import PretrainedBertConfig
from molscaletransfer.pretraining.modeling import BertForSequenceClassificationMolecule

model_name_or_path = "sagawatatsuya/molscaletransfer-chemlm-25.75m"
task_name = "bbbp"
task_config_path = "molscaletransfer/task_config.json"

task_to_keys = json.load(open(task_config_path))
task_info = task_to_keys[task_name]

if task_info["task_category"] == "regression":
    num_labels = 1
elif task_info["task_category"] == "classification":
    num_labels = 2
else:
    num_labels = len(task_info["target_columns"])

pretrain_run_args = json.load(open(f"{model_name_or_path}/args.json"))
ds_args = Namespace(**pretrain_run_args)

config = PretrainedBertConfig.from_pretrained(
    model_name_or_path,
    num_labels=num_labels,
    finetuning_task=task_name,
    layer_norm_type="pytorch",
    task_category=task_info["task_category"],
    fused_linear_layer=True,
    max_seq_length=task_info["max_seq_length"],
)

tokenizer = AutoTokenizer.from_pretrained(
    "ibm-research/MoLFormer-XL-both-10pct",
    trust_remote_code=True,
)

model = BertForSequenceClassificationMolecule.from_pretrained(
    model_name_or_path,
    config=config,
    args=ds_args,
)

For normal use, run_ft_molecule.py is recommended instead of manually writing this loading code.

Data

The pre-training data preparation follows the repository pipeline:

download ZINC-15 and PubChem SMILES used in MoLFormer-style pre-training,
preprocess, shard, and split the dataset,
create masked language modeling samples using ibm-research/MoLFormer-XL-both-10pct.

Pre-training Objective

The model was pre-trained with masked language modeling.

The sample generation configuration uses:

Setting	Value
Masked LM probability	0.15
Maximum sequence length	512
Maximum predictions per sequence	77
Tokenizer	`ibm-research/MoLFormer-XL-both-10pct`

Downstream Tasks

The repository defines task metadata in molscaletransfer/task_config.json.

Supported task categories include:

binary classification, e.g. BBBP, BACE, HIV
multitask classification, e.g. Tox21, ClinTox, SIDER
regression, e.g. QM9 targets, ESOL, FreeSolv, Lipophilicity

Metrics are selected from the task config:

Task category	Metric
Classification	ROC-AUC
Multitask classification	Mean ROC-AUC over tasks
Regression	MAE or RMSE, depending on the task config

Relation to MolScaleTransfer

This checkpoint is one example of the model family used in MolScaleTransfer, a toolkit for evaluating the scaling behavior and transfer performance of chemical language models.

Within that workflow, checkpoints like this can be used for:

pre-training loss evaluation,
Hessian trace or PGM analysis after conversion to Hugging Face format,
downstream fine-tuning and linear probe experiments,
comparison against larger or smaller checkpoints in the same scaling series.

Citation

If you use this model, please cite:

@misc{sagawa2026largescalechemicallanguagemodels,
      title={How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?},
      author={Tatsuya Sagawa and Ryosuke Kojima},
      year={2026},
      eprint={2602.11618},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.11618},
}

Downloads last month: 13

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including sagawa/molscaletransfer-chemlm-25.75m

MolScaleTransfer

Collection

6 items • Updated about 6 hours ago

Paper for sagawa/molscaletransfer-chemlm-25.75m

How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?

Paper • 2602.11618 • Published Apr 1