| --- |
| license: mit |
| --- |
| # 1. Model Overview |
|
|
| - **Model Name:** MMPT-FM & its MMP variants |
| - **Summary:** MMPT-FM (Matched Molecular Pair Transformation Foundation Model) and its MMP (Matched Molecular Pair) variants – MMP-M2M (molecule-to-molecule), MMP-M2T (molecule-to-transformation), MMP-C2V (constant-to-variable) – are generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs) or MMPs, i.e., context-independent variable-to-variable chemical modifications or matched molecular pairs derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series. |
| - **Model Specification:** Encoder–decoder Transformer. 220M parameters for each model. |
| - **Developed by:** Merck & Co., Inc. (Rahway, NJ, USA) and Emory University. |
| - **License:** MIT license. |
| - **Base Model:** ChemT5 (chemistry-domain pretrained T5). |
| - **Model Type:** Transformer |
| - **Languages:** SMARTS & SMILES (chemical substructure representation) |
| - **Pipeline Tag:** text2text-generation for MMP transformation |
| - **Library:** Transformers, PyTorch |
| # 2. Intended Use |
|
|
| - **Direct Use:** |
| - **MMPT-FM:** |
| - Generation of chemically valid matched molecular pair transformations (MMPTs) |
| - Analog design at a user-specified edit site. |
| - **MMP-M2M:** |
| - Generation of chemically valid matched molecular pairs (MMPs) |
| - **MMP-M2T:** |
| - Generation of chemically valid matched molecular pair transformations |
| - Analog design at a user-specified edit site |
| - **MMP-C2V:** |
| - Analog design at a user-specified edit site |
| - **Downstream Use:** |
| - **MMPT-FM:** |
| - Integration into analog enumeration pipelines |
| - Integration into high-throughput virtual screening pipelines |
| - Serve as the base model for retrieval-augmented generation (MMPT-RAG). |
| - **MMP-M2M:** |
| - Integration into analog enumeration pipelines |
| - Integration into high-throughput virtual screening pipelines |
| - **MMP-M2T:** |
| - Integration into analog enumeration pipelines |
| - Integration into high-throughput virtual screening pipelines |
| - **MMP-C2V:** |
| - Integration into analog enumeration pipelines |
| - Integration into high-throughput virtual screening pipelines |
| # 3. Bias, Risks, and Limitations |
|
|
| - **Known Limitations:** The models rely on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains. |
| - **Biases:** Inherits biases from ChEMBL-derived medicinal chemistry literature. |
| - **Risk Areas:** Our framework is intended for research use and does not introduce specific ethical concerns. |
| - **Recommendations:** None |
| # 4. Training Details |
|
|
| - **Training Data:** Raw data is downloaded from ChEMBL database and available at [https://chembl.gitbook.io/chembl-interface-documentation/downloads](https://chembl.gitbook.io/chembl-interface-documentation/downloads). |
| - **Training Data Preprocessing:** |
| - Drug-likeness filtering using `rule_of_druglike_soft` |
| - Molecular weight ≥ 200 Da |
| - Removal of structural alerts using the curated Walters alert list |
| - Data is processed with MMPDB that is available at [https://github.com/rdkit/mmpdb](https://github.com/rdkit/mmpdb). |
| - **Pre-Training:** Base model ChemT5 is available at [https://github.com/GT4SD/multitask_text_and_chemistry_t5](https://github.com/GT4SD/multitask_text_and_chemistry_t5). |
| - **Training Procedure:** |
| - Supervised sequence-to-sequence learning |
| - Cross-entropy loss |
| - Batch size: 64 |
| - Learning rate: `5 × 10⁻⁴` |
| - Hardware: |
| - MMPT-FM: 4 × NVIDIA A6000 GPUs |
| - MMP variants: 4 × NVIDIA H100 GPUs |
| # 5. Evaluation |
|
|
| - **Metrics:** |
| - Validity |
| - Novelty (Novel/valid, Novel/all) |
| - Recall (overall, in-training, out-of-training) |
| - **Benchmarks:** |
| - Held-out ChEMBL MMPT test set (in-distribution) |
| - Within-patent analog generation (PMV17) |
| - Cross-patent analog generation (PMV17 → PMV21) |
| - **Testing Data:** Patent-derived datasets from PMV Pharmaceuticals (2017, 2021) |
| # 6. Usage |
|
|
| - **Sample Inference Code:** Described conceptually in the publications; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: [https://github.com/MSDLLCpapers/MMPTTransformer](https://github.com/MSDLLCpapers/MMPTTransformer). |
| - **GitHub Links:** [https://github.com/MSDLLCpapers/MMPTTransformer](https://github.com/MSDLLCpapers/MMPTTransformer) |
| # 7. Citation |
|
|
| **BibTeX:** |
|
|
| ```bibtex |
| @article{pang2026scalable, |
| title={Scalable and Generalizable Analog Design via Learning Medicinal Chemistry Intuition from Matched Molecular Pair Transformations}, |
| author={Pang, Hao-Wei and Zhang, Peter Zhiping and Pan, Bo and Zhao, Liang and Yu, Xiang and Zhang, Liying}, |
| journal={ChemRxiv}, |
| doi={10.26434/chemrxiv.15001722}, |
| year={2026} |
| } |
| |
| @article{pan2026retrieval, |
| title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition}, |
| author={Pan, Bo and Zhang, Peter Zhiping and Pang, Hao-Wei and Zhu, Alex and Yu, Xiang and Zhang, Liying and Zhao, Liang}, |
| journal={arXiv preprint arXiv:2602.16684}, |
| year={2026} |
| } |
| |
| @article{pan2026transformer, |
| title={Transformer-Based Approach for Automated Functional Group Replacement in Chemical Compounds}, |
| author={Pan, Bo and Zhang, Zhiping and Spiekermann, Kevin and Chen, Tianchi and Yu, Xiang and Zhang, Liying and Zhao, Liang}, |
| journal={arXiv preprint arXiv:2601.07930}, |
| year={2026} |
| } |
| ``` |
|
|