Safetensors
model-ingest commited on
Commit
2b3129b
·
verified ·
1 Parent(s): 2cc99b8

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -0
README.md ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ ## 1. Model Overview
6
+ - **Model Name:** MMPT-FM (Matched Molecular Pair Transformation Foundation Model)
7
+ - **Summary:** MMPT-FM is a transformation-centric generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs), i.e., context-independent variable-to-variable chemical modifications derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
8
+ - **Model Specification:** Encoder–decoder Transformer. 220M parameters.
9
+ - **Developed by:** Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
10
+ - **License:** MIT license.
11
+ - **Base Model:** ChemT5 (chemistry-domain pretrained T5).
12
+ - **Model Type:** Transformer
13
+ - **Languages:** SMARTS (chemical substructure representation)
14
+ - **Pipeline Tag:** text2text-generation for MMP transformation
15
+ - **Library:** Transformers, PyTorch
16
+
17
+ ## 2. Intended Use
18
+ - **Direct Use:**
19
+ - Generation of chemically valid **matched molecular pair transformations (MMPTs)**.
20
+ - Analog design at a **user-specified edit site** (R-group substitution or core hopping)
21
+ - **Downstream Use:**
22
+ - Integration into analog enumeration pipelines
23
+ - Retrieval-augmented generation (MMPT-RAG) to bias suggestions toward project- or series-specific chemistry
24
+
25
+ ## 3. Bias, Risks, and Limitations
26
+ - **Known Limitations:** The model relies on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
27
+ - **Biases:** Inherits biases from ChEMBL-derived medicinal chemistry literature.
28
+ - **Risk Areas:** Our framework is intended for research use, and does not introduce specific ethical concerns.
29
+
30
+ ## 4. Training Details
31
+ - **Training Data:** Raw data is downloaded from ChEMBL database and available at https://chembl.gitbook.io/chembl-interface-documentation/downloads.
32
+ - **Training Data Preprocessing:**
33
+ - Drug-likeness filtering using *rule_of_druglike_soft*
34
+ - Molecular weight ≥ 200 Da
35
+ - Removal of structural alerts using the curated Walters alert list
36
+ - Data is processed with MMPDB that is available at https://github.com/rdkit/mmpdb.
37
+ - **Pre-Training:** Base model ChemT5 is available at https://github.com/GT4SD/multitask_text_and_chemistry_t5.
38
+ - **Training Procedure:**
39
+ - Supervised sequence-to-sequence learning with teacher forcing
40
+ - Cross-entropy loss
41
+ - Batch size: 64
42
+ - Learning rate: 5 × 10⁻⁴
43
+ - Hardware:
44
+ - MMPT-FM: 4 × NVIDIA A6000 GPUs
45
+ - MMP-based baselines: 4 × NVIDIA H100 GPUs
46
+
47
+ ## 5. Evaluation
48
+ - **Metrics:**
49
+ - Validity
50
+ - Novelty (Novel/valid, Novel/all)
51
+ - Recall (overall, in-training, out-of-training)
52
+ - **Benchmarks:**
53
+ - Held-out ChEMBL MMPT test set (in-distribution)
54
+ - Within-patent analog generation (PMV17)
55
+ - Cross-patent analog generation (PMV17 → PMV21)
56
+ - **Testing Data:** Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)
57
+
58
+ ## 6. Usage
59
+ - **Sample Inference Code:** Described conceptually in the publications ; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: https://github.com/MSDLLCpapers/MMPTTransformer.
60
+
61
+ ## 7. Citation
62
+ ```bibtex
63
+ @misc{pan2026retrievalaugmentedfoundationmodelsmatched,
64
+ title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
65
+ author={Bo Pan and Peter Zhiping Zhang and Hao-Wei Pang and Alex Zhu and Xiang Yu and Liying Zhang and Liang Zhao},
66
+ year={2026},
67
+ eprint={2602.16684},
68
+ archivePrefix={arXiv},
69
+ primaryClass={cs.LG},
70
+ url={https://arxiv.org/abs/2602.16684},
71
+ }
72
+ @article{
73
+ doi:10.26434/chemrxiv.15001722/v1,
74
+ author = {Hao-Wei Pang and Peter Zhiping Zhang and Bo Pan and Liang Zhao and Xiang Yu and Liying Zhang },
75
+ title = {Scalable and Generalizable Analog Design via Learning Medicinal Chemistry Intuition from Matched Molecular Pair Transformations},
76
+ journal = {ChemRxiv},
77
+ volume = {2026},
78
+ number = {0407},
79
+ pages = {},
80
+ year = {2026},
81
+ doi = {10.26434/chemrxiv.15001722/v1},
82
+ URL = {https://chemrxiv.org/doi/abs/10.26434/chemrxiv.15001722/v1},
83
+ eprint = {https://chemrxiv.org/doi/pdf/10.26434/chemrxiv.15001722/v1},
84
+ }