--- license: mit --- # 1. Model Overview - **Model Name:** MMPT-FM & its MMP variants - **Summary:** MMPT-FM (Matched Molecular Pair Transformation Foundation Model) and its MMP (Matched Molecular Pair) variants – MMP-M2M (molecule-to-molecule), MMP-M2T (molecule-to-transformation), MMP-C2V (constant-to-variable) – are generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs) or MMPs, i.e., context-independent variable-to-variable chemical modifications or matched molecular pairs derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series. - **Model Specification:** Encoder–decoder Transformer. 220M parameters for each model. - **Developed by:** Merck & Co., Inc. (Rahway, NJ, USA) and Emory University. - **License:** MIT license. - **Base Model:** ChemT5 (chemistry-domain pretrained T5). - **Model Type:** Transformer - **Languages:** SMARTS & SMILES (chemical substructure representation) - **Pipeline Tag:** text2text-generation for MMP transformation - **Library:** Transformers, PyTorch # 2. Intended Use - **Direct Use:** - **MMPT-FM:** - Generation of chemically valid matched molecular pair transformations (MMPTs) - Analog design at a user-specified edit site. - **MMP-M2M:** - Generation of chemically valid matched molecular pairs (MMPs) - **MMP-M2T:** - Generation of chemically valid matched molecular pair transformations - Analog design at a user-specified edit site - **MMP-C2V:** - Analog design at a user-specified edit site - **Downstream Use:** - **MMPT-FM:** - Integration into analog enumeration pipelines - Integration into high-throughput virtual screening pipelines - Serve as the base model for retrieval-augmented generation (MMPT-RAG). - **MMP-M2M:** - Integration into analog enumeration pipelines - Integration into high-throughput virtual screening pipelines - **MMP-M2T:** - Integration into analog enumeration pipelines - Integration into high-throughput virtual screening pipelines - **MMP-C2V:** - Integration into analog enumeration pipelines - Integration into high-throughput virtual screening pipelines # 3. Bias, Risks, and Limitations - **Known Limitations:** The models rely on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains. - **Biases:** Inherits biases from ChEMBL-derived medicinal chemistry literature. - **Risk Areas:** Our framework is intended for research use and does not introduce specific ethical concerns. - **Recommendations:** None # 4. Training Details - **Training Data:** Raw data is downloaded from ChEMBL database and available at [https://chembl.gitbook.io/chembl-interface-documentation/downloads](https://chembl.gitbook.io/chembl-interface-documentation/downloads). - **Training Data Preprocessing:** - Drug-likeness filtering using `rule_of_druglike_soft` - Molecular weight ≥ 200 Da - Removal of structural alerts using the curated Walters alert list - Data is processed with MMPDB that is available at [https://github.com/rdkit/mmpdb](https://github.com/rdkit/mmpdb). - **Pre-Training:** Base model ChemT5 is available at [https://github.com/GT4SD/multitask_text_and_chemistry_t5](https://github.com/GT4SD/multitask_text_and_chemistry_t5). - **Training Procedure:** - Supervised sequence-to-sequence learning - Cross-entropy loss - Batch size: 64 - Learning rate: `5 × 10⁻⁴` - Hardware: - MMPT-FM: 4 × NVIDIA A6000 GPUs - MMP variants: 4 × NVIDIA H100 GPUs # 5. Evaluation - **Metrics:** - Validity - Novelty (Novel/valid, Novel/all) - Recall (overall, in-training, out-of-training) - **Benchmarks:** - Held-out ChEMBL MMPT test set (in-distribution) - Within-patent analog generation (PMV17) - Cross-patent analog generation (PMV17 → PMV21) - **Testing Data:** Patent-derived datasets from PMV Pharmaceuticals (2017, 2021) # 6. Usage - **Sample Inference Code:** Described conceptually in the publications; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: [https://github.com/MSDLLCpapers/MMPTTransformer](https://github.com/MSDLLCpapers/MMPTTransformer). - **GitHub Links:** [https://github.com/MSDLLCpapers/MMPTTransformer](https://github.com/MSDLLCpapers/MMPTTransformer) # 7. Citation **BibTeX:** ```bibtex @article{pang2026scalable, title={Scalable and Generalizable Analog Design via Learning Medicinal Chemistry Intuition from Matched Molecular Pair Transformations}, author={Pang, Hao-Wei and Zhang, Peter Zhiping and Pan, Bo and Zhao, Liang and Yu, Xiang and Zhang, Liying}, journal={ChemRxiv}, doi={10.26434/chemrxiv.15001722}, year={2026} } @article{pan2026retrieval, title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition}, author={Pan, Bo and Zhang, Peter Zhiping and Pang, Hao-Wei and Zhu, Alex and Yu, Xiang and Zhang, Liying and Zhao, Liang}, journal={arXiv preprint arXiv:2602.16684}, year={2026} } @article{pan2026transformer, title={Transformer-Based Approach for Automated Functional Group Replacement in Chemical Compounds}, author={Pan, Bo and Zhang, Zhiping and Spiekermann, Kevin and Chen, Tianchi and Yu, Xiang and Zhang, Liying and Zhao, Liang}, journal={arXiv preprint arXiv:2601.07930}, year={2026} } ```