| --- |
| library_name: transformers |
| tags: |
| - smiles |
| - chemistry |
| - BERT |
| - molecules |
| license: mit |
| datasets: |
| - fabikru/half-of-chembl-2025-randomized-smiles-cleaned |
| --- |
| |
| # MolEncoder |
|
|
| MolEncoder is a BERT-based chemical language model pretrained on SMILES strings using masked language modeling (MLM). It was designed to investigate optimal pretraining strategies for molecular representation learning, with a particular focus on masking ratio, dataset size, and model size. It is described in detail in the paper "MolEncoder: Towards Optimal Masked Language Modeling for Molecules". |
|
|
| ## Model Description |
|
|
| - **Architecture:** Encoder-only transformer based on ModernBERT |
| - **Parameters:** ~15M |
| - **Tokenizer:** Character-level tokenizer covering full SMILES vocabulary |
| - **Pretraining Objective:** Masked language modeling with optimized masking ratios (30% found to work best for molecules) |
| - **Pretraining Data:** Pretrained on ~1M molecules (half of ChEMBL) |
|
|
| ## Key Findings |
|
|
| - Higher masking ratios (20–60%) outperform the standard 15% used in prior molecular BERT models. |
| - Increasing model size or dataset size beyond moderate scales yields no consistent performance benefits and can degrade efficiency. |
| - This 15M parameter model pretrained on ~1M molecules outperforms much larger models pretrained on more SMILES strings. |
|
|
| ## Intended Uses |
|
|
| - **Primary use:** Molecular property prediction through fine-tuning on downstream datasets |
|
|
| ## How to Use |
|
|
| Please refer to the [MolEncoder GitHub repository](https://github.com/FabianKruger/MolEncoder) for detailed instructions and ready-to-use examples on fine-tuning the model on custom data and running predictions. |
|
|
| ## Citation |
|
|
| If you use this model, please cite: |
| ```bibtex |
| @Article{D5DD00369E, |
| author ="Krüger, Fabian P. and Österbacka, Nicklas and Kabeshov, Mikhail and Engkvist, Ola and Tetko, Igor", |
| title ="MolEncoder: towards optimal masked language modeling for molecules", |
| journal ="Digital Discovery", |
| year ="2025", |
| pages ="-", |
| publisher ="RSC", |
| doi ="10.1039/D5DD00369E", |
| url ="http://dx.doi.org/10.1039/D5DD00369E"} |
| ``` |