Flansma commited on
Commit
c124254
·
verified ·
1 Parent(s): bcce6d4

Update to CREMP pretrained model with new performance results

Browse files
.gitattributes CHANGED
@@ -35,3 +35,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  assets/HELM-BERT.png filter=lfs diff=lfs merge=lfs -text
37
  assets/tsne_ppi_splits.png filter=lfs diff=lfs merge=lfs -text
 
 
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  assets/HELM-BERT.png filter=lfs diff=lfs merge=lfs -text
37
  assets/tsne_ppi_splits.png filter=lfs diff=lfs merge=lfs -text
38
+ assets/tsne_permeability_splits.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -22,7 +22,7 @@ A peptide language model using **HELM (Hierarchical Editing Language for Macromo
22
 
23
  ## Model Description
24
 
25
- HELM-BERT is built upon the DeBERTa architecture, designed for peptide sequences in HELM notation:
26
 
27
  - **Disentangled Attention**: Decomposes attention into content-content and content-position terms
28
  - **Enhanced Mask Decoder (EMD)**: Injects absolute position embeddings at the decoder stage
@@ -41,6 +41,9 @@ HELM-BERT is built upon the DeBERTa architecture, designed for peptide sequences
41
  | Attention heads | 12 |
42
  | Vocab size | 78 |
43
  | Max token length | 512 |
 
 
 
44
 
45
  ## How to Use
46
 
@@ -58,8 +61,9 @@ embeddings = outputs.last_hidden_state
58
 
59
  ## Training Data
60
 
61
- Pretrained on deduplicated peptide sequences from:
62
  - **ChEMBL**: Bioactive molecules database
 
63
  - **CycPeptMPDB**: Cyclic peptide membrane permeability database
64
  - **Propedia**: Protein-peptide interaction database
65
 
@@ -67,18 +71,32 @@ Pretrained on deduplicated peptide sequences from:
67
 
68
  ### Permeability Regression (CycPeptMPDB)
69
 
70
- | | Pearson | RMSE | MAE |
71
- |:--:|:-------:|:----:|:---:|
72
- | 0.759 | 0.872 | 0.383 | 0.277 |
73
 
74
- Train/test 9:1, val 10% from train.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
 
76
  ### PPI Classification (Propedia v2)
77
 
78
  | Split | ROC-AUC | PR-AUC | F1 | MCC | Balanced Acc |
79
  |:-----:|:-------:|:------:|:--:|:---:|:------------:|
80
- | Random | 0.971 | 0.914 | 0.853 | 0.816 | 0.912 |
81
- | aCSM | 0.879 | 0.714 | 0.591 | 0.539 | 0.722 |
82
 
83
  Train/test 8:2, val 10% from train, 1:4 positive:negative ratio.
84
  - **Random**: random split
 
22
 
23
  ## Model Description
24
 
25
+ HELM-BERT is built upon the DeBERTa architecture, pre-trained on ~75k peptides from four databases (ChEMBL, CREMP, CycPeptMPDB, Propedia) using **Masked Language Modeling (MLM)** with a **Warmup-Stable-Decay (WSD)** learning rate schedule.
26
 
27
  - **Disentangled Attention**: Decomposes attention into content-content and content-position terms
28
  - **Enhanced Mask Decoder (EMD)**: Injects absolute position embeddings at the decoder stage
 
41
  | Attention heads | 12 |
42
  | Vocab size | 78 |
43
  | Max token length | 512 |
44
+ | Pre-training data | ~75k peptides (ChEMBL, CREMP, CycPeptMPDB, Propedia) |
45
+ | Pre-training objective | MLM (span masking, p=0.15) |
46
+ | LR schedule | Warmup-Stable-Decay (WSD) |
47
 
48
  ## How to Use
49
 
 
61
 
62
  ## Training Data
63
 
64
+ Pre-trained on deduplicated peptide sequences from:
65
  - **ChEMBL**: Bioactive molecules database
66
+ - **CREMP**: Cyclic peptide conformational ensemble database
67
  - **CycPeptMPDB**: Cyclic peptide membrane permeability database
68
  - **Propedia**: Protein-peptide interaction database
69
 
 
71
 
72
  ### Permeability Regression (CycPeptMPDB)
73
 
74
+ **Single-Assay** (mixed PAMPA/Caco-2 target):
 
 
75
 
76
+ | Split | | Pearson | RMSE | MAE |
77
+ |:-----:|:--:|:-------:|:----:|:---:|
78
+ | Random | 0.751 | 0.867 | 0.398 | 0.263 |
79
+ | Scaffold | 0.655 | 0.821 | 0.398 | 0.305 |
80
+
81
+ **Multi-Assay** (separate PAMPA and Caco-2 heads):
82
+
83
+ | Split | Assay | R² | Pearson | RMSE | MAE |
84
+ |:-----:|:-----:|:--:|:-------:|:----:|:---:|
85
+ | Random | PAMPA | 0.740 | 0.862 | 0.399 | 0.281 |
86
+ | Random | Caco-2 | 0.694 | 0.833 | 0.412 | 0.274 |
87
+ | Scaffold | PAMPA | 0.629 | 0.815 | 0.406 | 0.317 |
88
+ | Scaffold | Caco-2 | 0.625 | 0.822 | 0.426 | 0.316 |
89
+
90
+ Train/test 9:1, val 10% from train. Scaffold split by Murcko scaffolds.
91
+
92
+ <p align="center"><img src="assets/tsne_permeability_splits.png" width="800"></p>
93
 
94
  ### PPI Classification (Propedia v2)
95
 
96
  | Split | ROC-AUC | PR-AUC | F1 | MCC | Balanced Acc |
97
  |:-----:|:-------:|:------:|:--:|:---:|:------------:|
98
+ | Random | 0.972 | 0.913 | 0.855 | 0.819 | 0.909 |
99
+ | aCSM | 0.870 | 0.701 | 0.604 | 0.547 | 0.731 |
100
 
101
  Train/test 8:2, val 10% from train, 1:4 positive:negative ratio.
102
  - **Random**: random split
assets/tsne_permeability_splits.png ADDED

Git LFS Details

  • SHA256: b4e99eeb37e1ba494576eff75b6305656cd87731e78806e6e2bbcf7570329744
  • Pointer size: 132 Bytes
  • Size of remote file: 1.6 MB
config.json CHANGED
@@ -31,6 +31,6 @@
31
  "position_buckets": 256,
32
  "sep_token_id": 2,
33
  "share_att_key": false,
34
- "transformers_version": "5.3.0",
35
  "vocab_size": 78
36
  }
 
31
  "position_buckets": 256,
32
  "sep_token_id": 2,
33
  "share_att_key": false,
34
+ "transformers_version": "4.57.6",
35
  "vocab_size": 78
36
  }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:af4f60f03ddf756a4e03cb0b776762cdbcd6d4d770c5ade44aa29db767d35371
3
- size 219405856
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:83c82e0a023d6191e722294d211983e61fcd345004b54a099c89823e706c3cae
3
+ size 219166144