Update to CREMP pretrained model with new performance results
Browse files- .gitattributes +1 -0
- README.md +26 -8
- assets/tsne_permeability_splits.png +3 -0
- config.json +1 -1
- model.safetensors +2 -2
.gitattributes
CHANGED
|
@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
assets/HELM-BERT.png filter=lfs diff=lfs merge=lfs -text
|
| 37 |
assets/tsne_ppi_splits.png filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
assets/HELM-BERT.png filter=lfs diff=lfs merge=lfs -text
|
| 37 |
assets/tsne_ppi_splits.png filter=lfs diff=lfs merge=lfs -text
|
| 38 |
+
assets/tsne_permeability_splits.png filter=lfs diff=lfs merge=lfs -text
|
README.md
CHANGED
|
@@ -22,7 +22,7 @@ A peptide language model using **HELM (Hierarchical Editing Language for Macromo
|
|
| 22 |
|
| 23 |
## Model Description
|
| 24 |
|
| 25 |
-
HELM-BERT is built upon the DeBERTa architecture,
|
| 26 |
|
| 27 |
- **Disentangled Attention**: Decomposes attention into content-content and content-position terms
|
| 28 |
- **Enhanced Mask Decoder (EMD)**: Injects absolute position embeddings at the decoder stage
|
|
@@ -41,6 +41,9 @@ HELM-BERT is built upon the DeBERTa architecture, designed for peptide sequences
|
|
| 41 |
| Attention heads | 12 |
|
| 42 |
| Vocab size | 78 |
|
| 43 |
| Max token length | 512 |
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
## How to Use
|
| 46 |
|
|
@@ -58,8 +61,9 @@ embeddings = outputs.last_hidden_state
|
|
| 58 |
|
| 59 |
## Training Data
|
| 60 |
|
| 61 |
-
|
| 62 |
- **ChEMBL**: Bioactive molecules database
|
|
|
|
| 63 |
- **CycPeptMPDB**: Cyclic peptide membrane permeability database
|
| 64 |
- **Propedia**: Protein-peptide interaction database
|
| 65 |
|
|
@@ -67,18 +71,32 @@ Pretrained on deduplicated peptide sequences from:
|
|
| 67 |
|
| 68 |
### Permeability Regression (CycPeptMPDB)
|
| 69 |
|
| 70 |
-
|
| 71 |
-
|:--:|:-------:|:----:|:---:|
|
| 72 |
-
| 0.759 | 0.872 | 0.383 | 0.277 |
|
| 73 |
|
| 74 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
### PPI Classification (Propedia v2)
|
| 77 |
|
| 78 |
| Split | ROC-AUC | PR-AUC | F1 | MCC | Balanced Acc |
|
| 79 |
|:-----:|:-------:|:------:|:--:|:---:|:------------:|
|
| 80 |
-
| Random | 0.
|
| 81 |
-
| aCSM | 0.
|
| 82 |
|
| 83 |
Train/test 8:2, val 10% from train, 1:4 positive:negative ratio.
|
| 84 |
- **Random**: random split
|
|
|
|
| 22 |
|
| 23 |
## Model Description
|
| 24 |
|
| 25 |
+
HELM-BERT is built upon the DeBERTa architecture, pre-trained on ~75k peptides from four databases (ChEMBL, CREMP, CycPeptMPDB, Propedia) using **Masked Language Modeling (MLM)** with a **Warmup-Stable-Decay (WSD)** learning rate schedule.
|
| 26 |
|
| 27 |
- **Disentangled Attention**: Decomposes attention into content-content and content-position terms
|
| 28 |
- **Enhanced Mask Decoder (EMD)**: Injects absolute position embeddings at the decoder stage
|
|
|
|
| 41 |
| Attention heads | 12 |
|
| 42 |
| Vocab size | 78 |
|
| 43 |
| Max token length | 512 |
|
| 44 |
+
| Pre-training data | ~75k peptides (ChEMBL, CREMP, CycPeptMPDB, Propedia) |
|
| 45 |
+
| Pre-training objective | MLM (span masking, p=0.15) |
|
| 46 |
+
| LR schedule | Warmup-Stable-Decay (WSD) |
|
| 47 |
|
| 48 |
## How to Use
|
| 49 |
|
|
|
|
| 61 |
|
| 62 |
## Training Data
|
| 63 |
|
| 64 |
+
Pre-trained on deduplicated peptide sequences from:
|
| 65 |
- **ChEMBL**: Bioactive molecules database
|
| 66 |
+
- **CREMP**: Cyclic peptide conformational ensemble database
|
| 67 |
- **CycPeptMPDB**: Cyclic peptide membrane permeability database
|
| 68 |
- **Propedia**: Protein-peptide interaction database
|
| 69 |
|
|
|
|
| 71 |
|
| 72 |
### Permeability Regression (CycPeptMPDB)
|
| 73 |
|
| 74 |
+
**Single-Assay** (mixed PAMPA/Caco-2 target):
|
|
|
|
|
|
|
| 75 |
|
| 76 |
+
| Split | R² | Pearson | RMSE | MAE |
|
| 77 |
+
|:-----:|:--:|:-------:|:----:|:---:|
|
| 78 |
+
| Random | 0.751 | 0.867 | 0.398 | 0.263 |
|
| 79 |
+
| Scaffold | 0.655 | 0.821 | 0.398 | 0.305 |
|
| 80 |
+
|
| 81 |
+
**Multi-Assay** (separate PAMPA and Caco-2 heads):
|
| 82 |
+
|
| 83 |
+
| Split | Assay | R² | Pearson | RMSE | MAE |
|
| 84 |
+
|:-----:|:-----:|:--:|:-------:|:----:|:---:|
|
| 85 |
+
| Random | PAMPA | 0.740 | 0.862 | 0.399 | 0.281 |
|
| 86 |
+
| Random | Caco-2 | 0.694 | 0.833 | 0.412 | 0.274 |
|
| 87 |
+
| Scaffold | PAMPA | 0.629 | 0.815 | 0.406 | 0.317 |
|
| 88 |
+
| Scaffold | Caco-2 | 0.625 | 0.822 | 0.426 | 0.316 |
|
| 89 |
+
|
| 90 |
+
Train/test 9:1, val 10% from train. Scaffold split by Murcko scaffolds.
|
| 91 |
+
|
| 92 |
+
<p align="center"><img src="assets/tsne_permeability_splits.png" width="800"></p>
|
| 93 |
|
| 94 |
### PPI Classification (Propedia v2)
|
| 95 |
|
| 96 |
| Split | ROC-AUC | PR-AUC | F1 | MCC | Balanced Acc |
|
| 97 |
|:-----:|:-------:|:------:|:--:|:---:|:------------:|
|
| 98 |
+
| Random | 0.972 | 0.913 | 0.855 | 0.819 | 0.909 |
|
| 99 |
+
| aCSM | 0.870 | 0.701 | 0.604 | 0.547 | 0.731 |
|
| 100 |
|
| 101 |
Train/test 8:2, val 10% from train, 1:4 positive:negative ratio.
|
| 102 |
- **Random**: random split
|
assets/tsne_permeability_splits.png
ADDED
|
Git LFS Details
|
config.json
CHANGED
|
@@ -31,6 +31,6 @@
|
|
| 31 |
"position_buckets": 256,
|
| 32 |
"sep_token_id": 2,
|
| 33 |
"share_att_key": false,
|
| 34 |
-
"transformers_version": "
|
| 35 |
"vocab_size": 78
|
| 36 |
}
|
|
|
|
| 31 |
"position_buckets": 256,
|
| 32 |
"sep_token_id": 2,
|
| 33 |
"share_att_key": false,
|
| 34 |
+
"transformers_version": "4.57.6",
|
| 35 |
"vocab_size": 78
|
| 36 |
}
|
model.safetensors
CHANGED
|
@@ -1,3 +1,3 @@
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
-
oid sha256:
|
| 3 |
-
size
|
|
|
|
| 1 |
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:83c82e0a023d6191e722294d211983e61fcd345004b54a099c89823e706c3cae
|
| 3 |
+
size 219166144
|