Update README.md

49f0556 verified 10 days ago

7.66 kB

	---
	library_name: transformers
	tags:
	- chemistry
	- smiles
	- molecular-property-prediction
	- masked-language-modeling
	- bert
	license: apache-2.0
	---

	# ChemLM 25.75M

	ChemLM 25.75M is a small BERT-style chemical language model pre-trained with masked language modeling on molecular SMILES strings.

	This checkpoint is intended to be used as an encoder initialization for downstream molecular property prediction tasks through the fine-tuning and linear probe code in the original repository.

	- Repository: https://github.com/sagawatatsuya/chemlm_pretraining
	- Paper: How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?
	- Tokenizer: `ibm-research/MoLFormer-XL-both-10pct`
	- Model size: 25.75M parameters

	## Model Details

	\| Hyperparameter \| Value \|
	\|---\|---:\|
	\| Hidden size \| 512 \|
	\| Number of hidden layers \| 8 \|
	\| Number of attention heads \| 8 \|
	\| Intermediate size \| 2048 \|
	\| Vocabulary size \| 2362 \|
	\| Maximum sequence length during pre-training \| 512 \|

	The model was pre-trained using the Academic Budget BERT-based implementation in the repository and converted from a DeepSpeed checkpoint to Hugging Face format.

	## Intended Use

	This checkpoint is intended for downstream molecular property prediction by loading it with the custom model class defined in the repository:

	```python
	from pretraining.configs import PretrainedBertConfig
	from pretraining.modeling import BertForSequenceClassificationMolecule
	````

	The downstream model adds a task-specific classification or regression head on top of the pre-trained BERT encoder.

	The supported training modes are:

	* full fine-tuning
	* linear probe, where the BERT encoder is frozen and only the prediction head is trained

	## Model Architecture for Downstream Tasks

	For downstream molecular property prediction, the repository uses:

	```python
	BertForSequenceClassificationMolecule
	```

	This model consists of:

	1. a pre-trained `BertModel` encoder,
	2. `BertPoolerC`, which mean-pools non-special tokens using the attention mask,
	3. dropout,
	4. a linear prediction head:

	```python
	self.classifier = nn.Linear(config.hidden_size, self.num_labels)
	```

	The output is a Hugging Face `SequenceClassifierOutput` containing `loss` and `logits`.

	The loss function depends on the task type:

	\| Task type \| Number of labels \| Loss \|
	\| ------------------------ \| -----------------------: \| ------------------------ \|
	\| Regression \| 1 \| MSELoss \|
	\| Binary classification \| 2 \| CrossEntropyLoss \|
	\| Multitask classification \| Number of target columns \| Masked BCEWithLogitsLoss \|

	## Usage

	The intended usage is to load the pre-trained checkpoint inside the repository's downstream training script.

	First, clone the original repository and install the fine-tuning environment:

	```bash
	git clone https://github.com/sagawatatsuya/chemlm_pretraining.git
	cd chemlm_pretraining

	conda create -n chemlm_finetuning python=3.11
	conda activate chemlm_finetuning

	pip install -r requirements.txt
	pip install torch transformers==4.57.3
	pip install -U accelerate deepspeed
	```

	Then run fine-tuning, for example on BBBP:

	```bash
	python chemlm_pretraining/run_ft_molecule.py \
	--model_name_or_path "sagawatatsuya/chemlm-25.75m" \
	--tokenizer_name "ibm-research/MoLFormer-XL-both-10pct" \
	--train_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/train.csv" \
	--validation_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/valid.csv" \
	--test_file "./chemlm_pretraining/dataset/finetune_datasets/data/bbbp/test.csv" \
	--task_name "bbbp" \
	--task_config "chemlm_pretraining/task_config.json" \
	--do_train \
	--do_eval \
	--do_predict \
	--per_device_train_batch_size 256 \
	--per_device_eval_batch_size 512 \
	--learning_rate 3e-5 \
	--num_train_epochs 500 \
	--save_strategy epoch \
	--eval_strategy epoch
	```

	For linear probe evaluation, set:

	```bash
	--training_type "linear_probe"
	```

	In this mode, the script freezes the parameters under `model.bert` and trains only the task-specific head.

	## Loading the Model in Python

	The downstream script loads local converted checkpoints with the custom config and model class:

	```python
	import json
	from argparse import Namespace

	from transformers import AutoTokenizer
	from pretraining.configs import PretrainedBertConfig
	from pretraining.modeling import BertForSequenceClassificationMolecule

	model_name_or_path = "sagawatatsuya/chemlm-25.75m"
	task_name = "bbbp"
	task_config_path = "chemlm_pretraining/task_config.json"

	task_to_keys = json.load(open(task_config_path))
	task_info = task_to_keys[task_name]

	if task_info["task_category"] == "regression":
	num_labels = 1
	elif task_info["task_category"] == "classification":
	num_labels = 2
	else:
	num_labels = len(task_info["target_columns"])

	pretrain_run_args = json.load(open(f"{model_name_or_path}/args.json"))
	ds_args = Namespace(**pretrain_run_args)

	config = PretrainedBertConfig.from_pretrained(
	model_name_or_path,
	num_labels=num_labels,
	finetuning_task=task_name,
	layer_norm_type="pytorch",
	task_category=task_info["task_category"],
	fused_linear_layer=True,
	max_seq_length=task_info["max_seq_length"],
	)

	tokenizer = AutoTokenizer.from_pretrained(
	"ibm-research/MoLFormer-XL-both-10pct",
	trust_remote_code=True,
	)

	model = BertForSequenceClassificationMolecule.from_pretrained(
	model_name_or_path,
	config=config,
	args=ds_args,
	)
	```

	For normal use, `run_ft_molecule.py` is recommended instead of manually writing this loading code.

	## Data

	The pre-training data preparation follows the repository pipeline:

	1. download ZINC-15 and PubChem SMILES used in MoLFormer-style pre-training,
	2. preprocess, shard, and split the dataset,
	3. create masked language modeling samples using `ibm-research/MoLFormer-XL-both-10pct`.

	## Pre-training Objective

	The model was pre-trained with masked language modeling.

	The sample generation configuration uses:

	\| Setting \| Value \|
	\| -------------------------------- \| -------------------------------------: \|
	\| Masked LM probability \| 0.15 \|
	\| Maximum sequence length \| 512 \|
	\| Maximum predictions per sequence \| 77 \|
	\| Tokenizer \| `ibm-research/MoLFormer-XL-both-10pct` \|

	## Downstream Tasks

	The repository defines task metadata in `chemlm_pretraining/task_config.json`.

	Supported task categories include:

	* binary classification, e.g. BBBP, BACE, HIV
	* multitask classification, e.g. Tox21, ClinTox, SIDER
	* regression, e.g. QM9 targets, ESOL, FreeSolv, Lipophilicity

	Metrics are selected from the task config:

	\| Task category \| Metric \|
	\| ------------------------ \| ----------------------------------------- \|
	\| Classification \| ROC-AUC \|
	\| Multitask classification \| Mean ROC-AUC over tasks \|
	\| Regression \| MAE or RMSE, depending on the task config \|

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{sagawa2026largescalechemicallanguagemodels,
	title={How Well Do Large-Scale Chemical Language Models Transfer to Downstream Tasks?},
	author={Tatsuya Sagawa and Ryosuke Kojima},
	year={2026},
	eprint={2602.11618},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2602.11618},
	}
	```