Safetensors
MMPT-FM / README.md
model-ingest's picture
Upload README.md
2b3129b verified
|
raw
history blame
4.37 kB
---
license: mit
---
## 1. Model Overview
- **Model Name:** MMPT-FM (Matched Molecular Pair Transformation Foundation Model)
- **Summary:** MMPT-FM is a transformation-centric generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs), i.e., context-independent variable-to-variable chemical modifications derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
- **Model Specification:** Encoder–decoder Transformer. 220M parameters.
- **Developed by:** Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
- **License:** MIT license.
- **Base Model:** ChemT5 (chemistry-domain pretrained T5).
- **Model Type:** Transformer
- **Languages:** SMARTS (chemical substructure representation)
- **Pipeline Tag:** text2text-generation for MMP transformation
- **Library:** Transformers, PyTorch
## 2. Intended Use
- **Direct Use:**
- Generation of chemically valid **matched molecular pair transformations (MMPTs)**.
- Analog design at a **user-specified edit site** (R-group substitution or core hopping)
- **Downstream Use:**
- Integration into analog enumeration pipelines
- Retrieval-augmented generation (MMPT-RAG) to bias suggestions toward project- or series-specific chemistry
## 3. Bias, Risks, and Limitations
- **Known Limitations:** The model relies on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
- **Biases:** Inherits biases from ChEMBL-derived medicinal chemistry literature.
- **Risk Areas:** Our framework is intended for research use, and does not introduce specific ethical concerns.
## 4. Training Details
- **Training Data:** Raw data is downloaded from ChEMBL database and available at https://chembl.gitbook.io/chembl-interface-documentation/downloads.
- **Training Data Preprocessing:**
- Drug-likeness filtering using *rule_of_druglike_soft*
- Molecular weight ≥ 200 Da
- Removal of structural alerts using the curated Walters alert list
- Data is processed with MMPDB that is available at https://github.com/rdkit/mmpdb.
- **Pre-Training:** Base model ChemT5 is available at https://github.com/GT4SD/multitask_text_and_chemistry_t5.
- **Training Procedure:**
- Supervised sequence-to-sequence learning with teacher forcing
- Cross-entropy loss
- Batch size: 64
- Learning rate: 5 × 10⁻⁴
- Hardware:
- MMPT-FM: 4 × NVIDIA A6000 GPUs
- MMP-based baselines: 4 × NVIDIA H100 GPUs
## 5. Evaluation
- **Metrics:**
- Validity
- Novelty (Novel/valid, Novel/all)
- Recall (overall, in-training, out-of-training)
- **Benchmarks:**
- Held-out ChEMBL MMPT test set (in-distribution)
- Within-patent analog generation (PMV17)
- Cross-patent analog generation (PMV17 → PMV21)
- **Testing Data:** Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)
## 6. Usage
- **Sample Inference Code:** Described conceptually in the publications ; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: https://github.com/MSDLLCpapers/MMPTTransformer.
## 7. Citation
```bibtex
@misc{pan2026retrievalaugmentedfoundationmodelsmatched,
title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
author={Bo Pan and Peter Zhiping Zhang and Hao-Wei Pang and Alex Zhu and Xiang Yu and Liying Zhang and Liang Zhao},
year={2026},
eprint={2602.16684},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.16684},
}
@article{
doi:10.26434/chemrxiv.15001722/v1,
author = {Hao-Wei Pang and Peter Zhiping Zhang and Bo Pan and Liang Zhao and Xiang Yu and Liying Zhang },
title = {Scalable and Generalizable Analog Design via Learning Medicinal Chemistry Intuition from Matched Molecular Pair Transformations},
journal = {ChemRxiv},
volume = {2026},
number = {0407},
pages = {},
year = {2026},
doi = {10.26434/chemrxiv.15001722/v1},
URL = {https://chemrxiv.org/doi/abs/10.26434/chemrxiv.15001722/v1},
eprint = {https://chemrxiv.org/doi/pdf/10.26434/chemrxiv.15001722/v1},
}