--- library_name: transformers tags: - chemistry - smiles - molecular-property-prediction - masked-language-modeling - bert license: apache-2.0 --- # ChemLM 0.83M ChemLM 0.83M is a small BERT-style chemical language model pre-trained with masked language modeling on molecular SMILES strings. This checkpoint is intended to be used as an encoder initialization for downstream molecular property prediction tasks through the fine-tuning and linear probe code in the original repository. - Repository: https://github.com/sagawatatsuya/chemlm_pretraining - Paper: *How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?* - Tokenizer: `ibm-research/MoLFormer-XL-both-10pct` - Model size: 0.83M parameters ## Model Details | Hyperparameter | Value | |---|---:| | Hidden size | 128 | | Number of hidden layers | 4 | | Number of attention heads | 2 | | Intermediate size | 512 | | Vocabulary size | 2362 | | Maximum sequence length during pre-training | 512 | The model was pre-trained using the Academic Budget BERT-based implementation in the repository and converted from a DeepSpeed checkpoint to Hugging Face format. ## Intended Use This checkpoint is intended for downstream molecular property prediction by loading it with the custom model class defined in the repository: ```python from pretraining.configs import PretrainedBertConfig from pretraining.modeling import BertForSequenceClassificationMolecule ```` The downstream model adds a task-specific classification or regression head on top of the pre-trained BERT encoder. The supported training modes are: * full fine-tuning * linear probe, where the BERT encoder is frozen and only the prediction head is trained ## Model Architecture for Downstream Tasks For downstream molecular property prediction, the repository uses: ```python BertForSequenceClassificationMolecule ``` This model consists of: 1. a pre-trained `BertModel` encoder, 2. `BertPoolerC`, which mean-pools non-special tokens using the attention mask, 3. dropout, 4. a linear prediction head: ```python self.classifier = nn.Linear(config.hidden_size, self.num_labels) ``` The output is a Hugging Face `SequenceClassifierOutput` containing `loss` and `logits`. The loss function depends on the task type: | Task type | Number of labels | Loss | | ------------------------ | -----------------------: | ------------------------ | | Regression | 1 | MSELoss | | Binary classification | 2 | CrossEntropyLoss | | Multitask classification | Number of target columns | Masked BCEWithLogitsLoss | ## Usage The intended usage is to load the pre-trained checkpoint inside the repository's downstream training script. First, clone the original repository and install the fine-tuning environment: ```bash git clone https://github.com/sagawatatsuya/chemlm_pretraining.git cd chemlm_pretraining conda create -n chemlm_finetuning python=3.11 conda activate chemlm_finetuning pip install -r requirements.txt pip install torch transformers==4.57.3 pip install -U accelerate deepspeed ``` Then run fine-tuning, for example on BBBP: ```bash python chemlm_pretraining/run_ft_molecule.py \ --model_name_or_path "sagawatatsuya/chemlm-0.83m" \ --tokenizer_name "ibm-research/MoLFormer-XL-both-10pct" \ --train_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/train.csv" \ --validation_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/valid.csv" \ --test_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/test.csv" \ --task_name "bbbp" \ --task_config "chemlm_pretraining/task_config.json" \ --do_train \ --do_eval \ --do_predict \ --per_device_train_batch_size 256 \ --per_device_eval_batch_size 512 \ --learning_rate 3e-5 \ --num_train_epochs 500 \ --save_strategy epoch \ --eval_strategy epoch ``` For linear probe evaluation, set: ```bash --training_type "linear_probe" ``` In this mode, the script freezes the parameters under `model.bert` and trains only the task-specific head. ## Loading the Model in Python The downstream script loads local converted checkpoints with the custom config and model class: ```python import json from argparse import Namespace from transformers import AutoTokenizer from pretraining.configs import PretrainedBertConfig from pretraining.modeling import BertForSequenceClassificationMolecule model_name_or_path = "sagawatatsuya/chemlm-0.83m" task_name = "bbbp" task_config_path = "chemlm_pretraining/task_config.json" task_to_keys = json.load(open(task_config_path)) task_info = task_to_keys[task_name] if task_info["task_category"] == "regression": num_labels = 1 elif task_info["task_category"] == "classification": num_labels = 2 else: num_labels = len(task_info["target_columns"]) pretrain_run_args = json.load(open(f"{model_name_or_path}/args.json")) ds_args = Namespace(**pretrain_run_args) config = PretrainedBertConfig.from_pretrained( model_name_or_path, num_labels=num_labels, finetuning_task=task_name, layer_norm_type="pytorch", task_category=task_info["task_category"], fused_linear_layer=True, max_seq_length=task_info["max_seq_length"], ) tokenizer = AutoTokenizer.from_pretrained( "ibm-research/MoLFormer-XL-both-10pct", trust_remote_code=True, ) model = BertForSequenceClassificationMolecule.from_pretrained( model_name_or_path, config=config, args=ds_args, ) ``` For normal use, `run_ft_molecule.py` is recommended instead of manually writing this loading code. ## Data The pre-training data preparation follows the repository pipeline: 1. download ZINC-15 and PubChem SMILES used in MoLFormer-style pre-training, 2. preprocess, shard, and split the dataset, 3. create masked language modeling samples using `ibm-research/MoLFormer-XL-both-10pct`. ## Pre-training Objective The model was pre-trained with masked language modeling. The sample generation configuration uses: | Setting | Value | | -------------------------------- | -------------------------------------: | | Masked LM probability | 0.15 | | Maximum sequence length | 512 | | Maximum predictions per sequence | 77 | | Tokenizer | `ibm-research/MoLFormer-XL-both-10pct` | ## Downstream Tasks The repository defines task metadata in `chemlm_pretraining/task_config.json`. Supported task categories include: * binary classification, e.g. BBBP, BACE, HIV * multitask classification, e.g. Tox21, ClinTox, SIDER * regression, e.g. QM9 targets, ESOL, FreeSolv, Lipophilicity Metrics are selected from the task config: | Task category | Metric | | ------------------------ | ----------------------------------------- | | Classification | ROC-AUC | | Multitask classification | Mean ROC-AUC over tasks | | Regression | MAE or RMSE, depending on the task config | ## Citation If you use this model, please cite: ```bibtex @misc{sagawa2026largescalechemicallanguagemodels, title={How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?}, author={Tatsuya Sagawa and Ryosuke Kojima}, year={2026}, eprint={2602.11618}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.11618}, } ```