MMPT-FM / README.md

Update README.md

a4241d7 verified 4 days ago

5.64 kB

	---
	license: mit
	---
	# 1. Model Overview

	- Model Name: MMPT-FM & its MMP variants
	- Summary: MMPT-FM (Matched Molecular Pair Transformation Foundation Model) and its MMP (Matched Molecular Pair) variants – MMP-M2M (molecule-to-molecule), MMP-M2T (molecule-to-transformation), MMP-C2V (constant-to-variable) – are generative foundation model designed to support medicinal chemistry analog design. The model learns from matched molecular pair transformations (MMPTs) or MMPs, i.e., context-independent variable-to-variable chemical modifications or matched molecular pairs derived from large-scale matched molecular pair data. This formulation enables scalable, interpretable, and generalizable encoding of medicinal chemistry intuition across diverse chemical series.
	- Model Specification: Encoder–decoder Transformer. 220M parameters for each model.
	- Developed by: Merck & Co., Inc. (Rahway, NJ, USA) and Emory University.
	- License: MIT license.
	- Base Model: ChemT5 (chemistry-domain pretrained T5).
	- Model Type: Transformer
	- Languages: SMARTS & SMILES (chemical substructure representation)
	- Pipeline Tag: text2text-generation for MMP transformation
	- Library: Transformers, PyTorch
	# 2. Intended Use

	- Direct Use:
	- MMPT-FM:
	- Generation of chemically valid matched molecular pair transformations (MMPTs)
	- Analog design at a user-specified edit site.
	- MMP-M2M:
	- Generation of chemically valid matched molecular pairs (MMPs)
	- MMP-M2T:
	- Generation of chemically valid matched molecular pair transformations
	- Analog design at a user-specified edit site
	- MMP-C2V:
	- Analog design at a user-specified edit site
	- Downstream Use:
	- MMPT-FM:
	- Integration into analog enumeration pipelines
	- Integration into high-throughput virtual screening pipelines
	- Serve as the base model for retrieval-augmented generation (MMPT-RAG).
	- MMP-M2M:
	- Integration into analog enumeration pipelines
	- Integration into high-throughput virtual screening pipelines
	- MMP-M2T:
	- Integration into analog enumeration pipelines
	- Integration into high-throughput virtual screening pipelines
	- MMP-C2V:
	- Integration into analog enumeration pipelines
	- Integration into high-throughput virtual screening pipelines
	# 3. Bias, Risks, and Limitations

	- Known Limitations: The models rely on the availability and coverage of large historical transformation datasets, and its performance may vary in underrepresented chemical domains.
	- Biases: Inherits biases from ChEMBL-derived medicinal chemistry literature.
	- Risk Areas: Our framework is intended for research use and does not introduce specific ethical concerns.
	- Recommendations: None
	# 4. Training Details

	- Training Data: Raw data is downloaded from ChEMBL database and available at [https://chembl.gitbook.io/chembl-interface-documentation/downloads](https://chembl.gitbook.io/chembl-interface-documentation/downloads).
	- Training Data Preprocessing:
	- Drug-likeness filtering using `rule_of_druglike_soft`
	- Molecular weight ≥ 200 Da
	- Removal of structural alerts using the curated Walters alert list
	- Data is processed with MMPDB that is available at [https://github.com/rdkit/mmpdb](https://github.com/rdkit/mmpdb).
	- Pre-Training: Base model ChemT5 is available at [https://github.com/GT4SD/multitask_text_and_chemistry_t5](https://github.com/GT4SD/multitask_text_and_chemistry_t5).
	- Training Procedure:
	- Supervised sequence-to-sequence learning
	- Cross-entropy loss
	- Batch size: 64
	- Learning rate: `5 × 10⁻⁴`
	- Hardware:
	- MMPT-FM: 4 × NVIDIA A6000 GPUs
	- MMP variants: 4 × NVIDIA H100 GPUs
	# 5. Evaluation

	- Metrics:
	- Validity
	- Novelty (Novel/valid, Novel/all)
	- Recall (overall, in-training, out-of-training)
	- Benchmarks:
	- Held-out ChEMBL MMPT test set (in-distribution)
	- Within-patent analog generation (PMV17)
	- Cross-patent analog generation (PMV17 → PMV21)
	- Testing Data: Patent-derived datasets from PMV Pharmaceuticals (2017, 2021)
	# 6. Usage

	- Sample Inference Code: Described conceptually in the publications; code corresponds to variable-to-variable generation with beam search can be found at our GitHub repository: [https://github.com/MSDLLCpapers/MMPTTransformer](https://github.com/MSDLLCpapers/MMPTTransformer).
	- GitHub Links: [https://github.com/MSDLLCpapers/MMPTTransformer](https://github.com/MSDLLCpapers/MMPTTransformer)
	# 7. Citation

	BibTeX:

	```bibtex
	@article{pang2026scalable,
	title={Scalable and Generalizable Analog Design via Learning Medicinal Chemistry Intuition from Matched Molecular Pair Transformations},
	author={Pang, Hao-Wei and Zhang, Peter Zhiping and Pan, Bo and Zhao, Liang and Yu, Xiang and Zhang, Liying},
	journal={ChemRxiv},
	doi={10.26434/chemrxiv.15001722},
	year={2026}
	}

	@article{pan2026retrieval,
	title={Retrieval-Augmented Foundation Models for Matched Molecular Pair Transformations to Recapitulate Medicinal Chemistry Intuition},
	author={Pan, Bo and Zhang, Peter Zhiping and Pang, Hao-Wei and Zhu, Alex and Yu, Xiang and Zhang, Liying and Zhao, Liang},
	journal={arXiv preprint arXiv:2602.16684},
	year={2026}
	}

	@article{pan2026transformer,
	title={Transformer-Based Approach for Automated Functional Group Replacement in Chemical Compounds},
	author={Pan, Bo and Zhang, Zhiping and Spiekermann, Kevin and Chen, Tianchi and Yu, Xiang and Zhang, Liying and Zhao, Liang},
	journal={arXiv preprint arXiv:2601.07930},
	year={2026}
	}
	```