Update to CREMP pretrained model with new performance results

Browse files

Files changed (5) hide show

.gitattributes +1 -0
README.md +26 -8
assets/tsne_permeability_splits.png +3 -0
config.json +1 -1
model.safetensors +2 -2

.gitattributes CHANGED Viewed

@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 assets/HELM-BERT.png filter=lfs diff=lfs merge=lfs -text
 assets/tsne_ppi_splits.png filter=lfs diff=lfs merge=lfs -text

 *tfevents* filter=lfs diff=lfs merge=lfs -text
 assets/HELM-BERT.png filter=lfs diff=lfs merge=lfs -text
 assets/tsne_ppi_splits.png filter=lfs diff=lfs merge=lfs -text
+assets/tsne_permeability_splits.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -22,7 +22,7 @@ A peptide language model using **HELM (Hierarchical Editing Language for Macromo
 ## Model Description
-HELM-BERT is built upon the DeBERTa architecture, designed for peptide sequences in HELM notation:
 - **Disentangled Attention**: Decomposes attention into content-content and content-position terms
 - **Enhanced Mask Decoder (EMD)**: Injects absolute position embeddings at the decoder stage
@@ -41,6 +41,9 @@ HELM-BERT is built upon the DeBERTa architecture, designed for peptide sequences
 | Attention heads | 12 |
 | Vocab size | 78 |
 | Max token length | 512 |
 ## How to Use
@@ -58,8 +61,9 @@ embeddings = outputs.last_hidden_state
 ## Training Data
-Pretrained on deduplicated peptide sequences from:
 - **ChEMBL**: Bioactive molecules database
 - **CycPeptMPDB**: Cyclic peptide membrane permeability database
 - **Propedia**: Protein-peptide interaction database
@@ -67,18 +71,32 @@ Pretrained on deduplicated peptide sequences from:
 ### Permeability Regression (CycPeptMPDB)
-| R² | Pearson | RMSE | MAE |
-|:--:|:-------:|:----:|:---:|
-| 0.759 | 0.872 | 0.383 | 0.277 |
-Train/test 9:1, val 10% from train.
 ### PPI Classification (Propedia v2)
 | Split | ROC-AUC | PR-AUC | F1 | MCC | Balanced Acc |
 |:-----:|:-------:|:------:|:--:|:---:|:------------:|
-| Random | 0.971 | 0.914 | 0.853 | 0.816 | 0.912 |
-| aCSM | 0.879 | 0.714 | 0.591 | 0.539 | 0.722 |
 Train/test 8:2, val 10% from train, 1:4 positive:negative ratio.
 - **Random**: random split

 ## Model Description
+HELM-BERT is built upon the DeBERTa architecture, pre-trained on ~75k peptides from four databases (ChEMBL, CREMP, CycPeptMPDB, Propedia) using **Masked Language Modeling (MLM)** with a **Warmup-Stable-Decay (WSD)** learning rate schedule.
 - **Disentangled Attention**: Decomposes attention into content-content and content-position terms
 - **Enhanced Mask Decoder (EMD)**: Injects absolute position embeddings at the decoder stage
 | Attention heads | 12 |
 | Vocab size | 78 |
 | Max token length | 512 |
+| Pre-training data | ~75k peptides (ChEMBL, CREMP, CycPeptMPDB, Propedia) |
+| Pre-training objective | MLM (span masking, p=0.15) |
+| LR schedule | Warmup-Stable-Decay (WSD) |
 ## How to Use
 ## Training Data
+Pre-trained on deduplicated peptide sequences from:
 - **ChEMBL**: Bioactive molecules database
+- **CREMP**: Cyclic peptide conformational ensemble database
 - **CycPeptMPDB**: Cyclic peptide membrane permeability database
 - **Propedia**: Protein-peptide interaction database
 ### Permeability Regression (CycPeptMPDB)
+**Single-Assay** (mixed PAMPA/Caco-2 target):
+| Split | R² | Pearson | RMSE | MAE |
+|:-----:|:--:|:-------:|:----:|:---:|
+| Random | 0.751 | 0.867 | 0.398 | 0.263 |
+| Scaffold | 0.655 | 0.821 | 0.398 | 0.305 |
+**Multi-Assay** (separate PAMPA and Caco-2 heads):
+| Split | Assay | R² | Pearson | RMSE | MAE |
+|:-----:|:-----:|:--:|:-------:|:----:|:---:|
+| Random | PAMPA | 0.740 | 0.862 | 0.399 | 0.281 |
+| Random | Caco-2 | 0.694 | 0.833 | 0.412 | 0.274 |
+| Scaffold | PAMPA | 0.629 | 0.815 | 0.406 | 0.317 |
+| Scaffold | Caco-2 | 0.625 | 0.822 | 0.426 | 0.316 |
+Train/test 9:1, val 10% from train. Scaffold split by Murcko scaffolds.
+<p align="center"><img src="assets/tsne_permeability_splits.png" width="800"></p>
 ### PPI Classification (Propedia v2)
 | Split | ROC-AUC | PR-AUC | F1 | MCC | Balanced Acc |
 |:-----:|:-------:|:------:|:--:|:---:|:------------:|
+| Random | 0.972 | 0.913 | 0.855 | 0.819 | 0.909 |
+| aCSM | 0.870 | 0.701 | 0.604 | 0.547 | 0.731 |
 Train/test 8:2, val 10% from train, 1:4 positive:negative ratio.
 - **Random**: random split

assets/tsne_permeability_splits.png ADDED Viewed

Git LFS Details

SHA256: b4e99eeb37e1ba494576eff75b6305656cd87731e78806e6e2bbcf7570329744
Pointer size: 132 Bytes
Size of remote file: 1.6 MB

config.json CHANGED Viewed

@@ -31,6 +31,6 @@
   "position_buckets": 256,
   "sep_token_id": 2,
   "share_att_key": false,
-  "transformers_version": "5.3.0",
   "vocab_size": 78
 }

   "position_buckets": 256,
   "sep_token_id": 2,
   "share_att_key": false,
+  "transformers_version": "4.57.6",
   "vocab_size": 78
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:af4f60f03ddf756a4e03cb0b776762cdbcd6d4d770c5ade44aa29db767d35371
-size 219405856

 version https://git-lfs.github.com/spec/v1
+oid sha256:83c82e0a023d6191e722294d211983e61fcd345004b54a099c89823e706c3cae
+size 219166144