| --- |
| library_name: transformers |
| tags: |
| - chemistry |
| - smiles |
| - molecular-property-prediction |
| - masked-language-modeling |
| - bert |
| license: apache-2.0 |
| --- |
| |
| # ChemLM 25.75M |
|
|
| ChemLM 25.75M is a small BERT-style chemical language model pre-trained with masked language modeling on molecular SMILES strings. |
|
|
| This checkpoint is intended to be used as an encoder initialization for downstream molecular property prediction tasks through the fine-tuning and linear probe code in the original repository. |
|
|
| - Repository: https://github.com/sagawatatsuya/chemlm_pretraining |
| - Paper: *How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?* |
| - Tokenizer: `ibm-research/MoLFormer-XL-both-10pct` |
| - Model size: 25.75M parameters |
| |
| ## Model Details |
| |
| | Hyperparameter | Value | |
| |---|---:| |
| | Hidden size | 512 | |
| | Number of hidden layers | 8 | |
| | Number of attention heads | 8 | |
| | Intermediate size | 2048 | |
| | Vocabulary size | 2362 | |
| | Maximum sequence length during pre-training | 512 | |
| |
| The model was pre-trained using the Academic Budget BERT-based implementation in the repository and converted from a DeepSpeed checkpoint to Hugging Face format. |
| |
| ## Intended Use |
| |
| This checkpoint is intended for downstream molecular property prediction by loading it with the custom model class defined in the repository: |
| |
| ```python |
| from pretraining.configs import PretrainedBertConfig |
| from pretraining.modeling import BertForSequenceClassificationMolecule |
| ```` |
| |
| The downstream model adds a task-specific classification or regression head on top of the pre-trained BERT encoder. |
| |
| The supported training modes are: |
| |
| * full fine-tuning |
| * linear probe, where the BERT encoder is frozen and only the prediction head is trained |
| |
| ## Model Architecture for Downstream Tasks |
| |
| For downstream molecular property prediction, the repository uses: |
| |
| ```python |
| BertForSequenceClassificationMolecule |
| ``` |
| |
| This model consists of: |
| |
| 1. a pre-trained `BertModel` encoder, |
| 2. `BertPoolerC`, which mean-pools non-special tokens using the attention mask, |
| 3. dropout, |
| 4. a linear prediction head: |
| |
| ```python |
| self.classifier = nn.Linear(config.hidden_size, self.num_labels) |
| ``` |
| |
| The output is a Hugging Face `SequenceClassifierOutput` containing `loss` and `logits`. |
| |
| The loss function depends on the task type: |
| |
| | Task type | Number of labels | Loss | |
| | ------------------------ | -----------------------: | ------------------------ | |
| | Regression | 1 | MSELoss | |
| | Binary classification | 2 | CrossEntropyLoss | |
| | Multitask classification | Number of target columns | Masked BCEWithLogitsLoss | |
| |
| ## Usage |
| |
| The intended usage is to load the pre-trained checkpoint inside the repository's downstream training script. |
| |
| First, clone the original repository and install the fine-tuning environment: |
| |
| ```bash |
| git clone https://github.com/sagawatatsuya/chemlm_pretraining.git |
| cd chemlm_pretraining |
| |
| conda create -n chemlm_finetuning python=3.11 |
| conda activate chemlm_finetuning |
| |
| pip install -r requirements.txt |
| pip install torch transformers==4.57.3 |
| pip install -U accelerate deepspeed |
| ``` |
| |
| Then run fine-tuning, for example on BBBP: |
| |
| ```bash |
| python chemlm_pretraining/run_ft_molecule.py \ |
| --model_name_or_path "sagawatatsuya/chemlm-25.75m" \ |
| --tokenizer_name "ibm-research/MoLFormer-XL-both-10pct" \ |
| --train_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/train.csv" \ |
| --validation_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/valid.csv" \ |
| --test_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/test.csv" \ |
| --task_name "bbbp" \ |
| --task_config "chemlm_pretraining/task_config.json" \ |
| --do_train \ |
| --do_eval \ |
| --do_predict \ |
| --per_device_train_batch_size 256 \ |
| --per_device_eval_batch_size 512 \ |
| --learning_rate 3e-5 \ |
| --num_train_epochs 500 \ |
| --save_strategy epoch \ |
| --eval_strategy epoch |
| ``` |
| |
| For linear probe evaluation, set: |
| |
| ```bash |
| --training_type "linear_probe" |
| ``` |
| |
| In this mode, the script freezes the parameters under `model.bert` and trains only the task-specific head. |
| |
| ## Loading the Model in Python |
| |
| The downstream script loads local converted checkpoints with the custom config and model class: |
| |
| ```python |
| import json |
| from argparse import Namespace |
| |
| from transformers import AutoTokenizer |
| from pretraining.configs import PretrainedBertConfig |
| from pretraining.modeling import BertForSequenceClassificationMolecule |
| |
| model_name_or_path = "sagawatatsuya/chemlm-25.75m" |
| task_name = "bbbp" |
| task_config_path = "chemlm_pretraining/task_config.json" |
| |
| task_to_keys = json.load(open(task_config_path)) |
| task_info = task_to_keys[task_name] |
| |
| if task_info["task_category"] == "regression": |
| num_labels = 1 |
| elif task_info["task_category"] == "classification": |
| num_labels = 2 |
| else: |
| num_labels = len(task_info["target_columns"]) |
| |
| pretrain_run_args = json.load(open(f"{model_name_or_path}/args.json")) |
| ds_args = Namespace(**pretrain_run_args) |
| |
| config = PretrainedBertConfig.from_pretrained( |
| model_name_or_path, |
| num_labels=num_labels, |
| finetuning_task=task_name, |
| layer_norm_type="pytorch", |
| task_category=task_info["task_category"], |
| fused_linear_layer=True, |
| max_seq_length=task_info["max_seq_length"], |
| ) |
| |
| tokenizer = AutoTokenizer.from_pretrained( |
| "ibm-research/MoLFormer-XL-both-10pct", |
| trust_remote_code=True, |
| ) |
| |
| model = BertForSequenceClassificationMolecule.from_pretrained( |
| model_name_or_path, |
| config=config, |
| args=ds_args, |
| ) |
| ``` |
| |
| For normal use, `run_ft_molecule.py` is recommended instead of manually writing this loading code. |
| |
| ## Data |
| |
| The pre-training data preparation follows the repository pipeline: |
| |
| 1. download ZINC-15 and PubChem SMILES used in MoLFormer-style pre-training, |
| 2. preprocess, shard, and split the dataset, |
| 3. create masked language modeling samples using `ibm-research/MoLFormer-XL-both-10pct`. |
| |
| ## Pre-training Objective |
| |
| The model was pre-trained with masked language modeling. |
| |
| The sample generation configuration uses: |
| |
| | Setting | Value | |
| | -------------------------------- | -------------------------------------: | |
| | Masked LM probability | 0.15 | |
| | Maximum sequence length | 512 | |
| | Maximum predictions per sequence | 77 | |
| | Tokenizer | `ibm-research/MoLFormer-XL-both-10pct` | |
| |
| ## Downstream Tasks |
| |
| The repository defines task metadata in `chemlm_pretraining/task_config.json`. |
| |
| Supported task categories include: |
| |
| * binary classification, e.g. BBBP, BACE, HIV |
| * multitask classification, e.g. Tox21, ClinTox, SIDER |
| * regression, e.g. QM9 targets, ESOL, FreeSolv, Lipophilicity |
| |
| Metrics are selected from the task config: |
| |
| | Task category | Metric | |
| | ------------------------ | ----------------------------------------- | |
| | Classification | ROC-AUC | |
| | Multitask classification | Mean ROC-AUC over tasks | |
| | Regression | MAE or RMSE, depending on the task config | |
| |
| ## Citation |
| |
| If you use this model, please cite: |
| |
| ```bibtex |
| @misc{sagawa2026largescalechemicallanguagemodels, |
| title={How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?}, |
| author={Tatsuya Sagawa and Ryosuke Kojima}, |
| year={2026}, |
| eprint={2602.11618}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.LG}, |
| url={https://arxiv.org/abs/2602.11618}, |
| } |
| ``` |