nl45
/

Protein1

Model card Files Files and versions

xet

Community

nl45 commited on Jan 3

Commit

9a21501

verified ·

1 Parent(s): 1cedab2

Update README.md

Browse files

Files changed (1) hide show

README.md +200 -3

README.md CHANGED Viewed

@@ -1,3 +1,200 @@
----
-license: mit
----

+---
+license: mit
+---
+---
+language: en
+tags:
+- protein-function-prediction
+- bioinformatics
+- gene-ontology
+- multi-label-classification
+- esm-2
+- CAFA-6
+license: mit
+datasets:
+- CAFA-6
+metrics:
+- f1
+- precision
+- recall
+---
+# 🧬 CAFA 6 Protein Function Prediction
+> *"Once I was zero epochs old, my model said to me... Go make yourself some predictions, don't wait for labeled data."*
+**BioBERT, I'm coming for you!** 🔥
+## Model Description
+State-of-the-art multi-label protein function prediction using ESM-2 embeddings. Predicts Gene Ontology (GO) terms across three ontologies from protein amino acid sequences.
+### What This Model Does
+Given a protein sequence like:
+```
+MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQAPILSRVGDGTQDNLSGAEKAVQVKVKALPDAQFEVVH...
+```
+It predicts:
+- **Molecular Function (MFO)**: What the protein DOES (e.g., "protein binding", "kinase activity")
+- **Biological Process (BPO)**: What pathways it's involved in (e.g., "signal transduction")
+- **Cellular Component (CCO)**: WHERE it's located (e.g., "nucleus", "membrane")
+## Files in This Repository
+- `train_esm2_embeddings.pkl` (427 MB) - Pre-computed ESM-2 embeddings for 82,404 training proteins
+- `test_esm2_embeddings.pkl` (1.16 GB) - Pre-computed ESM-2 embeddings for test proteins
+- `go_parser.pkl` (25.7 MB) - Gene Ontology hierarchy parser with 40,122 GO terms
+- `.gitattributes` - Git LFS configuration for large files
+## Dataset Statistics
+### Training Data
+- **Total proteins**: 82,404
+- **Total annotations**: 537,027
+- **Unique GO terms**: 26,125
+### Selected Terms for Prediction
+- **MFO**: 500 most frequent terms
+- **BPO**: 800 most frequent terms
+- **CCO**: 400 most frequent terms
+### Label Distribution
+| Ontology | Proteins with Labels | Avg Labels/Protein | Sparsity |
+|----------|---------------------|-------------------|----------|
+| MFO      | 49,751 (60.4%)      | 54.2              | 89.2%    |
+| BPO      | 44,382 (53.9%)      | 6.6               | 99.2%    |
+| CCO      | 58,505 (71.0%)      | 36.5              | 90.9%    |
+## Usage
+### Requirements
+```bash
+pip install torch biopython transformers huggingface_hub numpy
+```
+### Quick Start - Load Embeddings
+```python
+from huggingface_hub import hf_hub_download
+import pickle
+# Download embeddings
+embeddings_path = hf_hub_download(
+    repo_id="nl45/Protein1",
+    filename="train_esm2_embeddings.pkl"
+)
+# Load embeddings
+with open(embeddings_path, 'rb') as f:
+    embeddings = pickle.load(f)
+# embeddings is a dict: {protein_id: embedding_vector}
+print(f"Loaded embeddings for {len(embeddings)} proteins")
+print(f"Embedding dimension: {list(embeddings.values())[0].shape}")
+```
+### Generate New Embeddings for Your Protein
+```python
+from transformers import AutoTokenizer, EsmModel
+import torch
+# Load ESM-2 model
+tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t33_650M_UR50D")
+model = EsmModel.from_pretrained("facebook/esm2_t33_650M_UR50D")
+# Your protein sequence
+sequence = "MKTAYIAKQRQISFVKSHFSRQLE..."
+# Generate embedding
+inputs = tokenizer(sequence, return_tensors="pt", padding=True)
+with torch.no_grad():
+    outputs = model(**inputs)
+    embedding = outputs.last_hidden_state.mean(dim=1)  # Shape: [1, 1280]
+print(f"Generated embedding shape: {embedding.shape}")
+```
+### Load GO Parser
+```python
+# Download GO parser
+parser_path = hf_hub_download(
+    repo_id="nl45/Protein1",
+    filename="go_parser.pkl"
+)
+# Load parser
+with open(parser_path, 'rb') as f:
+    go_parser = pickle.load(f)
+# Example: Get GO term information
+term_info = go_parser.get_term_info("GO:0003674")
+print(f"Term: {term_info['name']}")
+print(f"Namespace: {term_info['namespace']}")
+```
+## Model Architecture
+The prediction model uses a Multi-Layer Perceptron (MLP):
+```
+Input: ESM-2 Embeddings (1280-dim)
+    ↓
+[Dense 2048] → BatchNorm → ReLU → Dropout(0.3)
+    ↓
+[Dense 1024] → BatchNorm → ReLU → Dropout(0.3)
+    ↓
+[Dense 512] → BatchNorm → ReLU → Dropout(0.3)
+    ↓
+[Dense Output] → Sigmoid
+    ↓
+Multi-label Predictions
+```
+**Training Details:**
+- Loss: Binary Cross-Entropy with Logits
+- Optimizer: Adam
+- Learning Rate: 0.001 with ReduceLROnPlateau
+- Early Stopping: Patience of 10 epochs
+## Data Processing Pipeline
+1. **Raw Sequences** (FASTA format) → Parse protein IDs and sequences
+2. **ESM-2 Encoding** → Generate 1280-dim embeddings using `facebook/esm2_t33_650M_UR50D`
+3. **GO Annotations** → Load and normalize GO terms
+4. **Label Preparation** → Create multi-label binary matrices with term propagation
+5. **Model Training** → Train separate models for MFO, BPO, CCO
+## Citation
+```bibtex
+@misc{nl45_cafa6_2026,
+  title={CAFA 6 Protein Function Prediction with ESM-2 Embeddings},
+  author={nl45},
+  year={2026},
+  publisher={Hugging Face},
+  howpublished={\url{https://huggingface.co/nl45/Protein1}}
+}
+```
+## Acknowledgments
+- **CAFA Challenge**: Critical Assessment of Functional Annotation
+- **ESM-2**: Evolutionary Scale Modeling from Meta AI
+- **Gene Ontology Consortium**: For GO term annotations
+## License
+MIT License
+## Contact
+For questions or collaboration: [Create an issue](https://huggingface.co/nl45/Protein1/discussions)
+---
+**"BioBERT, I'm coming for you!"** 🔥🧬