nl45 commited on
Commit
9a21501
Β·
verified Β·
1 Parent(s): 1cedab2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +200 -3
README.md CHANGED
@@ -1,3 +1,200 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ ---
5
+ language: en
6
+ tags:
7
+ - protein-function-prediction
8
+ - bioinformatics
9
+ - gene-ontology
10
+ - multi-label-classification
11
+ - esm-2
12
+ - CAFA-6
13
+ license: mit
14
+ datasets:
15
+ - CAFA-6
16
+ metrics:
17
+ - f1
18
+ - precision
19
+ - recall
20
+ ---
21
+
22
+ # 🧬 CAFA 6 Protein Function Prediction
23
+
24
+ > *"Once I was zero epochs old, my model said to me... Go make yourself some predictions, don't wait for labeled data."*
25
+
26
+ **BioBERT, I'm coming for you!** πŸ”₯
27
+
28
+ ## Model Description
29
+
30
+ State-of-the-art multi-label protein function prediction using ESM-2 embeddings. Predicts Gene Ontology (GO) terms across three ontologies from protein amino acid sequences.
31
+
32
+ ### What This Model Does
33
+
34
+ Given a protein sequence like:
35
+ ```
36
+ MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVH...
37
+ ```
38
+
39
+ It predicts:
40
+ - **Molecular Function (MFO)**: What the protein DOES (e.g., "protein binding", "kinase activity")
41
+ - **Biological Process (BPO)**: What pathways it's involved in (e.g., "signal transduction")
42
+ - **Cellular Component (CCO)**: WHERE it's located (e.g., "nucleus", "membrane")
43
+
44
+ ## Files in This Repository
45
+
46
+ - `train_esm2_embeddings.pkl` (427 MB) - Pre-computed ESM-2 embeddings for 82,404 training proteins
47
+ - `test_esm2_embeddings.pkl` (1.16 GB) - Pre-computed ESM-2 embeddings for test proteins
48
+ - `go_parser.pkl` (25.7 MB) - Gene Ontology hierarchy parser with 40,122 GO terms
49
+ - `.gitattributes` - Git LFS configuration for large files
50
+
51
+ ## Dataset Statistics
52
+
53
+ ### Training Data
54
+ - **Total proteins**: 82,404
55
+ - **Total annotations**: 537,027
56
+ - **Unique GO terms**: 26,125
57
+
58
+ ### Selected Terms for Prediction
59
+ - **MFO**: 500 most frequent terms
60
+ - **BPO**: 800 most frequent terms
61
+ - **CCO**: 400 most frequent terms
62
+
63
+ ### Label Distribution
64
+ | Ontology | Proteins with Labels | Avg Labels/Protein | Sparsity |
65
+ |----------|---------------------|-------------------|----------|
66
+ | MFO | 49,751 (60.4%) | 54.2 | 89.2% |
67
+ | BPO | 44,382 (53.9%) | 6.6 | 99.2% |
68
+ | CCO | 58,505 (71.0%) | 36.5 | 90.9% |
69
+
70
+ ## Usage
71
+
72
+ ### Requirements
73
+
74
+ ```bash
75
+ pip install torch biopython transformers huggingface_hub numpy
76
+ ```
77
+
78
+ ### Quick Start - Load Embeddings
79
+
80
+ ```python
81
+ from huggingface_hub import hf_hub_download
82
+ import pickle
83
+
84
+ # Download embeddings
85
+ embeddings_path = hf_hub_download(
86
+ repo_id="nl45/Protein1",
87
+ filename="train_esm2_embeddings.pkl"
88
+ )
89
+
90
+ # Load embeddings
91
+ with open(embeddings_path, 'rb') as f:
92
+ embeddings = pickle.load(f)
93
+
94
+ # embeddings is a dict: {protein_id: embedding_vector}
95
+ print(f"Loaded embeddings for {len(embeddings)} proteins")
96
+ print(f"Embedding dimension: {list(embeddings.values())[0].shape}")
97
+ ```
98
+
99
+ ### Generate New Embeddings for Your Protein
100
+
101
+ ```python
102
+ from transformers import AutoTokenizer, EsmModel
103
+ import torch
104
+
105
+ # Load ESM-2 model
106
+ tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
107
+ model = EsmModel.from_pretrained("facebook/esm2_t33_650M_UR50D")
108
+
109
+ # Your protein sequence
110
+ sequence = "MKTAYIAKQRQISFVKSHFSRQLE..."
111
+
112
+ # Generate embedding
113
+ inputs = tokenizer(sequence, return_tensors="pt", padding=True)
114
+ with torch.no_grad():
115
+ outputs = model(**inputs)
116
+ embedding = outputs.last_hidden_state.mean(dim=1) # Shape: [1, 1280]
117
+
118
+ print(f"Generated embedding shape: {embedding.shape}")
119
+ ```
120
+
121
+ ### Load GO Parser
122
+
123
+ ```python
124
+ # Download GO parser
125
+ parser_path = hf_hub_download(
126
+ repo_id="nl45/Protein1",
127
+ filename="go_parser.pkl"
128
+ )
129
+
130
+ # Load parser
131
+ with open(parser_path, 'rb') as f:
132
+ go_parser = pickle.load(f)
133
+
134
+ # Example: Get GO term information
135
+ term_info = go_parser.get_term_info("GO:0003674")
136
+ print(f"Term: {term_info['name']}")
137
+ print(f"Namespace: {term_info['namespace']}")
138
+ ```
139
+
140
+ ## Model Architecture
141
+
142
+ The prediction model uses a Multi-Layer Perceptron (MLP):
143
+
144
+ ```
145
+ Input: ESM-2 Embeddings (1280-dim)
146
+ ↓
147
+ [Dense 2048] β†’ BatchNorm β†’ ReLU β†’ Dropout(0.3)
148
+ ↓
149
+ [Dense 1024] β†’ BatchNorm β†’ ReLU β†’ Dropout(0.3)
150
+ ↓
151
+ [Dense 512] β†’ BatchNorm β†’ ReLU β†’ Dropout(0.3)
152
+ ↓
153
+ [Dense Output] β†’ Sigmoid
154
+ ↓
155
+ Multi-label Predictions
156
+ ```
157
+
158
+ **Training Details:**
159
+ - Loss: Binary Cross-Entropy with Logits
160
+ - Optimizer: Adam
161
+ - Learning Rate: 0.001 with ReduceLROnPlateau
162
+ - Early Stopping: Patience of 10 epochs
163
+
164
+ ## Data Processing Pipeline
165
+
166
+ 1. **Raw Sequences** (FASTA format) β†’ Parse protein IDs and sequences
167
+ 2. **ESM-2 Encoding** β†’ Generate 1280-dim embeddings using `facebook/esm2_t33_650M_UR50D`
168
+ 3. **GO Annotations** β†’ Load and normalize GO terms
169
+ 4. **Label Preparation** β†’ Create multi-label binary matrices with term propagation
170
+ 5. **Model Training** β†’ Train separate models for MFO, BPO, CCO
171
+
172
+ ## Citation
173
+
174
+ ```bibtex
175
+ @misc{nl45_cafa6_2026,
176
+ title={CAFA 6 Protein Function Prediction with ESM-2 Embeddings},
177
+ author={nl45},
178
+ year={2026},
179
+ publisher={Hugging Face},
180
+ howpublished={\url{https://huggingface.co/nl45/Protein1}}
181
+ }
182
+ ```
183
+
184
+ ## Acknowledgments
185
+
186
+ - **CAFA Challenge**: Critical Assessment of Functional Annotation
187
+ - **ESM-2**: Evolutionary Scale Modeling from Meta AI
188
+ - **Gene Ontology Consortium**: For GO term annotations
189
+
190
+ ## License
191
+
192
+ MIT License
193
+
194
+ ## Contact
195
+
196
+ For questions or collaboration: [Create an issue](https://huggingface.co/nl45/Protein1/discussions)
197
+
198
+ ---
199
+
200
+ **"BioBERT, I'm coming for you!"** πŸ”₯🧬