Safetensors
model-ingest commited on
Commit
f87bed3
·
verified ·
1 Parent(s): 43cdbee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -45
README.md CHANGED
@@ -1,50 +1,82 @@
1
  ---
2
  license: mit
3
  ---
 
4
 
5
- ## 1. Model Overview
6
- - **Model Name:** MMPT-FM (Matched Molecular Pair Transformation Foundation Model)
7
- - **Summary:** MMPT-FM is a transformation-centric generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs), i.e., context-independent variable-to-variable chemical modifications derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
8
- - **Model Specification:** Encoder–decoder Transformer. 220M parameters.
9
  - **Developed by:** Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
10
  - **License:** MIT license.
11
  - **Base Model:** ChemT5 (chemistry-domain pretrained T5).
12
  - **Model Type:** Transformer
13
- - **Languages:** SMARTS (chemical substructure representation)
14
  - **Pipeline Tag:** text2text-generation for MMP transformation
15
  - **Library:** Transformers, PyTorch
16
 
17
- ## 2. Intended Use
 
 
 
18
  - **Direct Use:**
19
- - Generation of chemically valid **matched molecular pair transformations (MMPTs)**.
20
- - Analog design at a **user-specified edit site** (R-group substitution or core hopping)
 
 
 
 
 
 
 
 
21
  - **Downstream Use:**
22
- - Integration into analog enumeration pipelines
23
- - Retrieval-augmented generation (MMPT-RAG) to bias suggestions toward project- or series-specific chemistry
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
- ## 3. Bias, Risks, and Limitations
26
- - **Known Limitations:** The model relies on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
27
  - **Biases:** Inherits biases from ChEMBL-derived medicinal chemistry literature.
28
- - **Risk Areas:** Our framework is intended for research use, and does not introduce specific ethical concerns.
 
29
 
30
- ## 4. Training Details
31
- - **Training Data:** Raw data is downloaded from ChEMBL database and available at https://chembl.gitbook.io/chembl-interface-documentation/downloads.
 
 
 
32
  - **Training Data Preprocessing:**
33
- - Drug-likeness filtering using *rule_of_druglike_soft*
34
  - Molecular weight ≥ 200 Da
35
  - Removal of structural alerts using the curated Walters alert list
36
- - Data is processed with MMPDB that is available at https://github.com/rdkit/mmpdb.
37
- - **Pre-Training:** Base model ChemT5 is available at https://github.com/GT4SD/multitask_text_and_chemistry_t5.
38
  - **Training Procedure:**
39
- - Supervised sequence-to-sequence learning with teacher forcing
40
  - Cross-entropy loss
41
  - Batch size: 64
42
- - Learning rate: 5 × 10⁻⁴
43
  - Hardware:
44
  - MMPT-FM: 4 × NVIDIA A6000 GPUs
45
- - MMP-based baselines: 4 × NVIDIA H100 GPUs
 
 
 
 
46
 
47
- ## 5. Evaluation
48
  - **Metrics:**
49
  - Validity
50
  - Novelty (Novel/valid, Novel/all)
@@ -55,30 +87,37 @@ license: mit
55
  - Cross-patent analog generation (PMV17 → PMV21)
56
  - **Testing Data:** Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)
57
 
58
- ## 6. Usage
59
- - **Sample Inference Code:** Described conceptually in the publications ; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: https://github.com/MSDLLCpapers/MMPTTransformer.
 
 
 
 
 
 
 
 
 
 
60
 
61
- ## 7. Citation
62
  ```bibtex
63
- @misc{pan2026retrievalaugmentedfoundationmodelsmatched,
64
- title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
65
- author={Bo Pan and Peter Zhiping Zhang and Hao-Wei Pang and Alex Zhu and Xiang Yu and Liying Zhang and Liang Zhao},
66
- year={2026},
67
- eprint={2602.16684},
68
- archivePrefix={arXiv},
69
- primaryClass={cs.LG},
70
- url={https://arxiv.org/abs/2602.16684},
71
  }
72
- @article{
73
- doi:10.26434/chemrxiv.15001722/v1,
74
- author = {Hao-Wei Pang and Peter Zhiping Zhang and Bo Pan and Liang Zhao and Xiang Yu and Liying Zhang },
75
- title = {Scalable and Generalizable Analog Design via Learning Medicinal Chemistry Intuition from Matched Molecular Pair Transformations},
76
- journal = {ChemRxiv},
77
- volume = {2026},
78
- number = {0407},
79
- pages = {},
80
- year = {2026},
81
- doi = {10.26434/chemrxiv.15001722/v1},
82
- URL = {https://chemrxiv.org/doi/abs/10.26434/chemrxiv.15001722/v1},
83
- eprint = {https://chemrxiv.org/doi/pdf/10.26434/chemrxiv.15001722/v1},
 
84
  }
 
 
1
  ---
2
  license: mit
3
  ---
4
+ # 1. Model Overview
5
 
6
+ - **Model Name:** MMPT-FM & its MMP variants
7
+ - **Summary:** MMPT-FM (Matched Molecular Pair Transformation Foundation Model) and its MMP (Matched Molecular Pair) variants – MMP-M2M (molecule-to-molecule), MMP-M2T (molecule-to-transformation), MMP-C2V (constant-to-variable) – are generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs) or MMPs, i.e., context-independent variable-to-variable chemical modifications or matched molecular pairs derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
8
+ - **Model Specification:** Encoder–decoder Transformer. 220M parameters for each model.
 
9
  - **Developed by:** Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
10
  - **License:** MIT license.
11
  - **Base Model:** ChemT5 (chemistry-domain pretrained T5).
12
  - **Model Type:** Transformer
13
+ - **Languages:** SMARTS & SMILES (chemical substructure representation)
14
  - **Pipeline Tag:** text2text-generation for MMP transformation
15
  - **Library:** Transformers, PyTorch
16
 
17
+ ---
18
+
19
+ # 2. Intended Use
20
+
21
  - **Direct Use:**
22
+ - **MMPT-FM:**
23
+ - Generation of chemically valid matched molecular pair transformations (MMPTs)
24
+ - Analog design at a user-specified edit site.
25
+ - **MMP-M2M:**
26
+ - Generation of chemically valid matched molecular pairs (MMPs)
27
+ - **MMP-M2T:**
28
+ - Generation of chemically valid matched molecular pair transformations
29
+ - Analog design at a user-specified edit site
30
+ - **MMP-C2V:**
31
+ - Analog design at a user-specified edit site
32
  - **Downstream Use:**
33
+ - **MMPT-FM:**
34
+ - Integration into analog enumeration pipelines
35
+ - Integration into high-throughput virtual screening pipelines
36
+ - Serve as the base model for retrieval-augmented generation (MMPT-RAG).
37
+ - **MMP-M2M:**
38
+ - Integration into analog enumeration pipelines
39
+ - Integration into high-throughput virtual screening pipelines
40
+ - **MMP-M2T:**
41
+ - Integration into analog enumeration pipelines
42
+ - Integration into high-throughput virtual screening pipelines
43
+ - **MMP-C2V:**
44
+ - Integration into analog enumeration pipelines
45
+ - Integration into high-throughput virtual screening pipelines
46
+
47
+ ---
48
+
49
+ # 3. Bias, Risks, and Limitations
50
 
51
+ - **Known Limitations:** The models rely on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
 
52
  - **Biases:** Inherits biases from ChEMBL-derived medicinal chemistry literature.
53
+ - **Risk Areas:** Our framework is intended for research use and does not introduce specific ethical concerns.
54
+ - **Recommendations:** None
55
 
56
+ ---
57
+
58
+ # 4. Training Details
59
+
60
+ - **Training Data:** Raw data is downloaded from ChEMBL database and available at [https://chembl.gitbook.io/chembl-interface-documentation/downloads](https://chembl.gitbook.io/chembl-interface-documentation/downloads).
61
  - **Training Data Preprocessing:**
62
+ - Drug-likeness filtering using `rule_of_druglike_soft`
63
  - Molecular weight ≥ 200 Da
64
  - Removal of structural alerts using the curated Walters alert list
65
+ - Data is processed with MMPDB that is available at [https://github.com/rdkit/mmpdb](https://github.com/rdkit/mmpdb).
66
+ - **Pre-Training:** Base model ChemT5 is available at [https://github.com/GT4SD/multitask_text_and_chemistry_t5](https://github.com/GT4SD/multitask_text_and_chemistry_t5).
67
  - **Training Procedure:**
68
+ - Supervised sequence-to-sequence learning
69
  - Cross-entropy loss
70
  - Batch size: 64
71
+ - Learning rate: `5 × 10⁻⁴`
72
  - Hardware:
73
  - MMPT-FM: 4 × NVIDIA A6000 GPUs
74
+ - MMP variants: 4 × NVIDIA H100 GPUs
75
+
76
+ ---
77
+
78
+ # 5. Evaluation
79
 
 
80
  - **Metrics:**
81
  - Validity
82
  - Novelty (Novel/valid, Novel/all)
 
87
  - Cross-patent analog generation (PMV17 → PMV21)
88
  - **Testing Data:** Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)
89
 
90
+ ---
91
+
92
+ # 6. Usage
93
+
94
+ - **Sample Inference Code:** Described conceptually in the publications; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: [https://github.com/MSDLLCpapers/MMPTTransformer](https://github.com/MSDLLCpapers/MMPTTransformer).
95
+ - **GitHub Links:** [https://github.com/MSDLLCpapers/MMPTTransformer](https://github.com/MSDLLCpapers/MMPTTransformer)
96
+
97
+ ---
98
+
99
+ # 7. Citation
100
+
101
+ **BibTeX:**
102
 
 
103
  ```bibtex
104
+ @article{pang2026scalable,
105
+ title={Scalable and Generalizable Analog Design via Learning Medicinal Chemistry Intuition from Matched Molecular Pair Transformations},
106
+ author={Pang, Hao-Wei and Zhang, Peter Zhiping and Pan, Bo and Zhao, Liang and Yu, Xiang and Zhang, Liying},
107
+ year={2026}
 
 
 
 
108
  }
109
+
110
+ @article{pan2026retrieval,
111
+ title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
112
+ author={Pan, Bo and Zhang, Peter Zhiping and Pang, Hao-Wei and Zhu, Alex and Yu, Xiang and Zhang, Liying and Zhao, Liang},
113
+ journal={arXiv preprint arXiv:2602.16684},
114
+ year={2026}
115
+ }
116
+
117
+ @article{pan2026transformer,
118
+ title={Transformer-Based Approach for Automated Functional Group Replacement in Chemical Compounds},
119
+ author={Pan, Bo and Zhang, Zhiping and Spiekermann, Kevin and Chen, Tianchi and Yu, Xiang and Zhang, Liying and Zhao, Liang},
120
+ journal={arXiv preprint arXiv:2601.07930},
121
+ year={2026}
122
  }
123
+ ```