---
license: mit
---
# 1. Model Overview

- **Model Name:** MMPT-FM & its MMP variants
- **Summary:** MMPT-FM (Matched Molecular Pair Transformation Foundation Model) and its MMP (Matched Molecular Pair) variants – MMP-M2M (molecule-to-molecule), MMP-M2T (molecule-to-transformation), MMP-C2V (constant-to-variable) – are generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs) or MMPs, i.e., context-independent variable-to-variable chemical modifications or matched molecular pairs derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
- **Model Specification:** Encoder–decoder Transformer. 220M parameters for each model.
- **Developed by:** Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
- **License:** MIT license.
- **Base Model:** ChemT5 (chemistry-domain pretrained T5).
- **Model Type:** Transformer
- **Languages:** SMARTS & SMILES (chemical substructure representation)
- **Pipeline Tag:** text2text-generation for MMP transformation
- **Library:** Transformers, PyTorch
# 2. Intended Use

- **Direct Use:**
  - **MMPT-FM:**
    - Generation of chemically valid matched molecular pair transformations (MMPTs)
    - Analog design at a user-specified edit site.
  - **MMP-M2M:**
    - Generation of chemically valid matched molecular pairs (MMPs)
  - **MMP-M2T:**
    - Generation of chemically valid matched molecular pair transformations
    - Analog design at a user-specified edit site
  - **MMP-C2V:**
    - Analog design at a user-specified edit site
- **Downstream Use:**
  - **MMPT-FM:**
    - Integration into analog enumeration pipelines
    - Integration into high-throughput virtual screening pipelines
    - Serve as the base model for retrieval-augmented generation (MMPT-RAG).
  - **MMP-M2M:**
    - Integration into analog enumeration pipelines
    - Integration into high-throughput virtual screening pipelines
  - **MMP-M2T:**
    - Integration into analog enumeration pipelines
    - Integration into high-throughput virtual screening pipelines
  - **MMP-C2V:**
    - Integration into analog enumeration pipelines
    - Integration into high-throughput virtual screening pipelines
# 3. Bias, Risks, and Limitations

- **Known Limitations:** The models rely on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
- **Biases:** Inherits biases from ChEMBL-derived medicinal chemistry literature.
- **Risk Areas:** Our framework is intended for research use and does not introduce specific ethical concerns.
- **Recommendations:** None
# 4. Training Details

- **Training Data:** Raw data is downloaded from ChEMBL database and available at [https://chembl.gitbook.io/chembl-interface-documentation/downloads](https://chembl.gitbook.io/chembl-interface-documentation/downloads).
- **Training Data Preprocessing:**
  - Drug-likeness filtering using `rule_of_druglike_soft`
  - Molecular weight ≥ 200 Da
  - Removal of structural alerts using the curated Walters alert list
  - Data is processed with MMPDB that is available at [https://github.com/rdkit/mmpdb](https://github.com/rdkit/mmpdb).
- **Pre-Training:** Base model ChemT5 is available at [https://github.com/GT4SD/multitask_text_and_chemistry_t5](https://github.com/GT4SD/multitask_text_and_chemistry_t5).
- **Training Procedure:**
  - Supervised sequence-to-sequence learning
  - Cross-entropy loss
  - Batch size: 64
  - Learning rate: `5 × 10⁻⁴`
  - Hardware:
    - MMPT-FM: 4 × NVIDIA A6000 GPUs
    - MMP variants: 4 × NVIDIA H100 GPUs
# 5. Evaluation

- **Metrics:**
  - Validity
  - Novelty (Novel/valid, Novel/all)
  - Recall (overall, in-training, out-of-training)
- **Benchmarks:**
  - Held-out ChEMBL MMPT test set (in-distribution)
  - Within-patent analog generation (PMV17)
  - Cross-patent analog generation (PMV17 → PMV21)
- **Testing Data:** Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)
# 6. Usage

- **Sample Inference Code:** Described conceptually in the publications; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: [https://github.com/MSDLLCpapers/MMPTTransformer](https://github.com/MSDLLCpapers/MMPTTransformer).
- **GitHub Links:** [https://github.com/MSDLLCpapers/MMPTTransformer](https://github.com/MSDLLCpapers/MMPTTransformer)
# 7. Citation

**BibTeX:**

```bibtex
@article{pang2026scalable,
  title={Scalable and Generalizable Analog Design via Learning Medicinal Chemistry Intuition from Matched Molecular Pair Transformations},
  author={Pang, Hao-Wei and Zhang, Peter Zhiping and Pan, Bo and Zhao, Liang and Yu, Xiang and Zhang, Liying},
  journal={ChemRxiv},
  doi={10.26434/chemrxiv.15001722},
  year={2026}
}

@article{pan2026retrieval,
  title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
  author={Pan, Bo and Zhang, Peter Zhiping and Pang, Hao-Wei and Zhu, Alex and Yu, Xiang and Zhang, Liying and Zhao, Liang},
  journal={arXiv preprint arXiv:2602.16684},
  year={2026}
}

@article{pan2026transformer,
  title={Transformer-Based Approach for Automated Functional Group Replacement in Chemical Compounds},
  author={Pan, Bo and Zhang, Zhiping and Spiekermann, Kevin and Chen, Tianchi and Yu, Xiang and Zhang, Liying and Zhao, Liang},
  journal={arXiv preprint arXiv:2601.07930},
  year={2026}
}
```