Safetensors
File size: 5,640 Bytes
2b3129b
 
 
f87bed3
2b3129b
f87bed3
 
 
2b3129b
 
 
 
f87bed3
2b3129b
 
f87bed3
 
2b3129b
f87bed3
 
 
 
 
 
 
 
 
 
2b3129b
f87bed3
 
 
 
 
 
 
 
 
 
 
 
 
 
2b3129b
f87bed3
2b3129b
f87bed3
 
 
 
 
2b3129b
f87bed3
2b3129b
 
f87bed3
 
2b3129b
f87bed3
2b3129b
 
f87bed3
2b3129b
 
f87bed3
 
2b3129b
 
 
 
 
 
 
 
 
 
f87bed3
 
 
 
 
 
 
2b3129b
 
f87bed3
 
 
83cdda7
a4241d7
f87bed3
2b3129b
f87bed3
 
 
 
 
 
 
 
 
 
 
 
 
2b3129b
f87bed3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
license: mit
---
# 1. Model Overview

- **Model Name:** MMPT-FM & its MMP variants
- **Summary:** MMPT-FM (Matched Molecular Pair Transformation Foundation Model) and its MMP (Matched Molecular Pair) variants – MMP-M2M (molecule-to-molecule), MMP-M2T (molecule-to-transformation), MMP-C2V (constant-to-variable) – are generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs) or MMPs, i.e., context-independent variable-to-variable chemical modifications or matched molecular pairs derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
- **Model Specification:** Encoder–decoder Transformer. 220M parameters for each model.
- **Developed by:** Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
- **License:** MIT license.
- **Base Model:** ChemT5 (chemistry-domain pretrained T5).
- **Model Type:** Transformer
- **Languages:** SMARTS & SMILES (chemical substructure representation)
- **Pipeline Tag:** text2text-generation for MMP transformation
- **Library:** Transformers, PyTorch
# 2. Intended Use

- **Direct Use:**
  - **MMPT-FM:**
    - Generation of chemically valid matched molecular pair transformations (MMPTs)
    - Analog design at a user-specified edit site.
  - **MMP-M2M:**
    - Generation of chemically valid matched molecular pairs (MMPs)
  - **MMP-M2T:**
    - Generation of chemically valid matched molecular pair transformations
    - Analog design at a user-specified edit site
  - **MMP-C2V:**
    - Analog design at a user-specified edit site
- **Downstream Use:**
  - **MMPT-FM:**
    - Integration into analog enumeration pipelines
    - Integration into high-throughput virtual screening pipelines
    - Serve as the base model for retrieval-augmented generation (MMPT-RAG).
  - **MMP-M2M:**
    - Integration into analog enumeration pipelines
    - Integration into high-throughput virtual screening pipelines
  - **MMP-M2T:**
    - Integration into analog enumeration pipelines
    - Integration into high-throughput virtual screening pipelines
  - **MMP-C2V:**
    - Integration into analog enumeration pipelines
    - Integration into high-throughput virtual screening pipelines
# 3. Bias, Risks, and Limitations

- **Known Limitations:** The models rely on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
- **Biases:** Inherits biases from ChEMBL-derived medicinal chemistry literature.
- **Risk Areas:** Our framework is intended for research use and does not introduce specific ethical concerns.
- **Recommendations:** None
# 4. Training Details

- **Training Data:** Raw data is downloaded from ChEMBL database and available at [https://chembl.gitbook.io/chembl-interface-documentation/downloads](https://chembl.gitbook.io/chembl-interface-documentation/downloads).
- **Training Data Preprocessing:**
  - Drug-likeness filtering using `rule_of_druglike_soft`
  - Molecular weight ≥ 200 Da
  - Removal of structural alerts using the curated Walters alert list
  - Data is processed with MMPDB that is available at [https://github.com/rdkit/mmpdb](https://github.com/rdkit/mmpdb).
- **Pre-Training:** Base model ChemT5 is available at [https://github.com/GT4SD/multitask_text_and_chemistry_t5](https://github.com/GT4SD/multitask_text_and_chemistry_t5).
- **Training Procedure:**
  - Supervised sequence-to-sequence learning
  - Cross-entropy loss
  - Batch size: 64
  - Learning rate: `5 × 10⁻⁴`
  - Hardware:
    - MMPT-FM: 4 × NVIDIA A6000 GPUs
    - MMP variants: 4 × NVIDIA H100 GPUs
# 5. Evaluation

- **Metrics:**
  - Validity
  - Novelty (Novel/valid, Novel/all)
  - Recall (overall, in-training, out-of-training)
- **Benchmarks:**
  - Held-out ChEMBL MMPT test set (in-distribution)
  - Within-patent analog generation (PMV17)
  - Cross-patent analog generation (PMV17 → PMV21)
- **Testing Data:** Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)
# 6. Usage

- **Sample Inference Code:** Described conceptually in the publications; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: [https://github.com/MSDLLCpapers/MMPTTransformer](https://github.com/MSDLLCpapers/MMPTTransformer).
- **GitHub Links:** [https://github.com/MSDLLCpapers/MMPTTransformer](https://github.com/MSDLLCpapers/MMPTTransformer)
# 7. Citation

**BibTeX:**

```bibtex
@article{pang2026scalable,
  title={Scalable and Generalizable Analog Design via Learning Medicinal Chemistry Intuition from Matched Molecular Pair Transformations},
  author={Pang, Hao-Wei and Zhang, Peter Zhiping and Pan, Bo and Zhao, Liang and Yu, Xiang and Zhang, Liying},
  journal={ChemRxiv},
  doi={10.26434/chemrxiv.15001722},
  year={2026}
}

@article{pan2026retrieval,
  title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
  author={Pan, Bo and Zhang, Peter Zhiping and Pang, Hao-Wei and Zhu, Alex and Yu, Xiang and Zhang, Liying and Zhao, Liang},
  journal={arXiv preprint arXiv:2602.16684},
  year={2026}
}

@article{pan2026transformer,
  title={Transformer-Based Approach for Automated Functional Group Replacement in Chemical Compounds},
  author={Pan, Bo and Zhang, Zhiping and Spiekermann, Kevin and Chen, Tianchi and Yu, Xiang and Zhang, Liying and Zhao, Liang},
  journal={arXiv preprint arXiv:2601.07930},
  year={2026}
}
```