--- license: mit --- ## 1. Model Overview - **Model Name:** MMPT-FM (Matched Molecular Pair Transformation Foundation Model) - **Summary:** MMPT-FM is a transformation-centric generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs), i.e., context-independent variable-to-variable chemical modifications derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series. - **Model Specification:** Encoder–decoder Transformer. 220M parameters. - **Developed by:** Merck & Co., Inc. (Rahway, NJ, USA) and Emory University. - **License:** MIT license. - **Base Model:** ChemT5 (chemistry-domain pretrained T5). - **Model Type:** Transformer - **Languages:** SMARTS (chemical substructure representation) - **Pipeline Tag:** text2text-generation for MMP transformation - **Library:** Transformers, PyTorch ## 2. Intended Use - **Direct Use:** - Generation of chemically valid **matched molecular pair transformations (MMPTs)**. - Analog design at a **user-specified edit site** (R-group substitution or core hopping) - **Downstream Use:** - Integration into analog enumeration pipelines - Retrieval-augmented generation (MMPT-RAG) to bias suggestions toward project- or series-specific chemistry ## 3. Bias, Risks, and Limitations - **Known Limitations:** The model relies on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains. - **Biases:** Inherits biases from ChEMBL-derived medicinal chemistry literature. - **Risk Areas:** Our framework is intended for research use, and does not introduce specific ethical concerns. ## 4. Training Details - **Training Data:** Raw data is downloaded from ChEMBL database and available at https://chembl.gitbook.io/chembl-interface-documentation/downloads. - **Training Data Preprocessing:** - Drug-likeness filtering using *rule_of_druglike_soft* - Molecular weight ≥ 200 Da - Removal of structural alerts using the curated Walters alert list - Data is processed with MMPDB that is available at https://github.com/rdkit/mmpdb. - **Pre-Training:** Base model ChemT5 is available at https://github.com/GT4SD/multitask_text_and_chemistry_t5. - **Training Procedure:** - Supervised sequence-to-sequence learning with teacher forcing - Cross-entropy loss - Batch size: 64 - Learning rate: 5 × 10⁻⁴ - Hardware: - MMPT-FM: 4 × NVIDIA A6000 GPUs - MMP-based baselines: 4 × NVIDIA H100 GPUs ## 5. Evaluation - **Metrics:** - Validity - Novelty (Novel/valid, Novel/all) - Recall (overall, in-training, out-of-training) - **Benchmarks:** - Held-out ChEMBL MMPT test set (in-distribution) - Within-patent analog generation (PMV17) - Cross-patent analog generation (PMV17 → PMV21) - **Testing Data:** Patent-derived datasets from PMV Pharmaceuticals (2017, 2021) ## 6. Usage - **Sample Inference Code:** Described conceptually in the publications ; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: https://github.com/MSDLLCpapers/MMPTTransformer. ## 7. Citation ```bibtex @misc{pan2026retrievalaugmentedfoundationmodelsmatched, title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition}, author={Bo Pan and Peter Zhiping Zhang and Hao-Wei Pang and Alex Zhu and Xiang Yu and Liying Zhang and Liang Zhao}, year={2026}, eprint={2602.16684}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2602.16684}, } @article{ doi:10.26434/chemrxiv.15001722/v1, author = {Hao-Wei Pang and Peter Zhiping Zhang and Bo Pan and Liang Zhao and Xiang Yu and Liying Zhang }, title = {Scalable and Generalizable Analog Design via Learning Medicinal Chemistry Intuition from Matched Molecular Pair Transformations}, journal = {ChemRxiv}, volume = {2026}, number = {0407}, pages = {}, year = {2026}, doi = {10.26434/chemrxiv.15001722/v1}, URL = {https://chemrxiv.org/doi/abs/10.26434/chemrxiv.15001722/v1}, eprint = {https://chemrxiv.org/doi/pdf/10.26434/chemrxiv.15001722/v1}, }