fabikru
/

MolEncoder

Model card Files Files and versions

MolEncoder / README.md

fabikru's picture

Updated citation

6e443fb verified 5 months ago

|

history blame contribute delete

2.11 kB

	---
	library_name: transformers
	tags:
	- smiles
	- chemistry
	- BERT
	- molecules
	license: mit
	datasets:
	- fabikru/half-of-chembl-2025-randomized-smiles-cleaned
	---

	# MolEncoder

	MolEncoder is a BERT-based chemical language model pretrained on SMILES strings using masked language modeling (MLM). It was designed to investigate optimal pretraining strategies for molecular representation learning, with a particular focus on masking ratio, dataset size, and model size. It is described in detail in the paper "MolEncoder: Towards Optimal Masked Language Modeling for Molecules".

	## Model Description

	- Architecture: Encoder-only transformer based on ModernBERT
	- Parameters: ~15M
	- Tokenizer: Character-level tokenizer covering full SMILES vocabulary
	- Pretraining Objective: Masked language modeling with optimized masking ratios (30% found to work best for molecules)
	- Pretraining Data: Pretrained on ~1M molecules (half of ChEMBL)

	## Key Findings

	- Higher masking ratios (20–60%) outperform the standard 15% used in prior molecular BERT models.
	- Increasing model size or dataset size beyond moderate scales yields no consistent performance benefits and can degrade efficiency.
	- This 15M parameter model pretrained on ~1M molecules outperforms much larger models pretrained on more SMILES strings.

	## Intended Uses

	- Primary use: Molecular property prediction through fine-tuning on downstream datasets

	## How to Use

	Please refer to the [MolEncoder GitHub repository](https://github.com/FabianKruger/MolEncoder) for detailed instructions and ready-to-use examples on fine-tuning the model on custom data and running predictions.

	## Citation

	If you use this model, please cite:
	```bibtex
	@Article{D5DD00369E,
	author ="Krüger, Fabian P. and Österbacka, Nicklas and Kabeshov, Mikhail and Engkvist, Ola and Tetko, Igor",
	title ="MolEncoder: towards optimal masked language modeling for molecules",
	journal ="Digital Discovery",
	year ="2025",
	pages ="-",
	publisher ="RSC",
	doi ="10.1039/D5DD00369E",
	url ="http://dx.doi.org/10.1039/D5DD00369E"}
	```