ysy20020107 commited on
Commit
0248453
·
verified ·
1 Parent(s): 42dc529

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -42
README.md CHANGED
@@ -7,34 +7,23 @@ metrics:
7
  accuracy: 0.68
8
  ---
9
 
10
- # Model Card for Model-Demo-35M
11
 
12
- ## Description
13
- This is a protein **EC number classification model** based on SaProt_35M_AF2, fine-tuned with LoRA. The model can classify proteins into **6 major EC classes (EC1-EC6)**. Since there are only 31 samples for EC7 in the raw dataset, this class is excluded from training and prediction.
14
-
15
- Label mapping:
16
- - **Label 0**: Oxidoreductase (EC1)
17
- - **Label 1**: Transferase (EC2)
18
- - **Label 2**: Hydrolase (EC3)
19
- - **Label 3**: Lyase (EC4)
20
- - **Label 4**: Isomerase (EC5)
21
- - **Label 5**: Ligase (EC6)
22
 
 
 
23
  Training data is obtained from: https://academic.oup.com/nar/article/54/D1/D643/8313833
24
 
25
- To address the **class imbalance problem** in the training set, we performed data augmentation:
26
- - Label 4 (EC5) samples were duplicated **2 times**
27
- - Label 5 (EC6) samples were duplicated **1 time**
28
-
29
- ## Task type
30
- Protein-level Classification
31
-
32
- ## Model input type
33
- Amino acid sequence (AA Sequence)
34
-
35
- ## Dataset Distribution
36
 
37
- ### Training set
38
  - Label 0: 1497 (28.5%)
39
  - Label 2: 1217 (23.2%)
40
  - Label 1: 1050 (19.9%)
@@ -43,7 +32,7 @@ Amino acid sequence (AA Sequence)
43
  - Label 5: 483 (9.2%)
44
  Total: 5255 samples
45
 
46
- ### Validation set
47
  - Label 0: 187 (32.0%)
48
  - Label 2: 152 (26.0%)
49
  - Label 1: 131 (22.4%)
@@ -52,7 +41,7 @@ Total: 5255 samples
52
  - Label 5: 20 (3.4%)
53
  Total: 585 samples
54
 
55
- ### Test set
56
  - Label 0: 188 (31.8%)
57
  - Label 2: 153 (25.9%)
58
  - Label 1: 132 (22.3%)
@@ -61,22 +50,23 @@ Total: 585 samples
61
  - Label 5: 21 (3.5%)
62
  Total: 591 samples
63
 
64
- ## Performance (on test set)
65
- - **Accuracy: 0.68**
 
66
 
67
- ## LoRA config
68
- - **r:** 8
69
- - **lora_dropout:** 0.1
70
- - **lora_alpha:** 16
71
- - **target_modules:** ['key', 'value', 'output.dense', 'intermediate.dense', 'query']
72
- - **modules_to_save:** ['classifier']
73
 
74
- ## Training config
75
- - **optimizer:**
76
- - **class:** AdamW
77
- - **betas:** (0.9, 0.98)
78
- - **weight_decay:** 0.01
79
- - **learning rate:** 0.0005
80
- - **epoch:** 25
81
- - **batch size:** 64
82
- - **precision:** 16-mixed
 
7
  accuracy: 0.68
8
  ---
9
 
10
+ Base model: westlake-repl/SaProt_35M_AF2
11
 
12
+ Task type: protein-level classification
 
 
 
 
 
 
 
 
 
13
 
14
+ Dataset: This model classifies proteins into 6 major EC classes (EC1-EC6). EC7 was excluded due to only 31 samples available.
15
+ To address class imbalance, Label 4 (EC5) was duplicated 2 times and Label 5 (EC6) was duplicated 1 time in the training set.
16
  Training data is obtained from: https://academic.oup.com/nar/article/54/D1/D643/8313833
17
 
18
+ Label mapping:
19
+ Label 0: Oxidoreductase (EC1)
20
+ Label 1: Transferase (EC2)
21
+ Label 2: Hydrolase (EC3)
22
+ Label 3: Lyase (EC4)
23
+ Label 4: Isomerase (EC5)
24
+ Label 5: Ligase (EC6)
 
 
 
 
25
 
26
+ Training set distribution:
27
  - Label 0: 1497 (28.5%)
28
  - Label 2: 1217 (23.2%)
29
  - Label 1: 1050 (19.9%)
 
32
  - Label 5: 483 (9.2%)
33
  Total: 5255 samples
34
 
35
+ Validation set distribution:
36
  - Label 0: 187 (32.0%)
37
  - Label 2: 152 (26.0%)
38
  - Label 1: 131 (22.4%)
 
41
  - Label 5: 20 (3.4%)
42
  Total: 585 samples
43
 
44
+ Test set distribution:
45
  - Label 0: 188 (31.8%)
46
  - Label 2: 153 (25.9%)
47
  - Label 1: 132 (22.3%)
 
50
  - Label 5: 21 (3.5%)
51
  Total: 591 samples
52
 
53
+ Model input type: Amino acid sequence
54
+
55
+ Performance (on test set): 0.68 accuracy
56
 
57
+ LoRA config:
58
+ r: 8
59
+ lora_dropout: 0.1
60
+ lora_alpha: 16
61
+ target_modules: ["key", "value", "output.dense", "intermediate.dense", "query"]
62
+ modules_to_save: ["classifier"]
63
 
64
+ Training config:
65
+ optimizer:
66
+ class: AdamW
67
+ betas: (0.9, 0.98)
68
+ weight_decay: 0.01
69
+ learning rate: 0.0005
70
+ epoch: 25
71
+ batch size: 64
72
+ precision: 16-mixed