GrimSqueaker
/

ModernProteinLM

ml-intern

Model card Files Files and versions

xet

Community

GrimSqueaker commited on 1 day ago

Commit

36bbb76

verified ·

1 Parent(s): d76120a

Upload CLUSTER_INSTRUCTIONS.md with huggingface_hub

Browse files

Files changed (1) hide show

CLUSTER_INSTRUCTIONS.md +315 -0

CLUSTER_INSTRUCTIONS.md ADDED Viewed

	@@ -0,0 +1,315 @@

+# ModernProteinLM — Private GPU Cluster Instructions
+## Overview
+ModernProteinLM is a next-generation protein encoder (<200M params) that combines:
+1. **ModernBERT architecture** (RoPE, Pre-LN, GeGLU, deep & narrow)
+2. **ELECTRA discriminative pre-training** (replaced token detection)
+3. **Span masking curriculum** (30% → 5% over training)
+This is the **first protein encoder** to combine all three proven techniques, targeting predictive downstream tasks (fluorescence, stability, solubility, structure, etc.).
+---
+## Quick Start
+```bash
+# 1. Clone / copy the codebase to your cluster
+# 2. Install dependencies
+pip install -r requirements.txt
+# 3. (Optional) Install FlashAttention for speedup
+pip install flash-attn --no-build-isolation
+# 4. Run pre-training
+bash run_pretrain.sh
+# 5. Run downstream fine-tuning + evaluation
+bash run_finetune.sh
+```
+---
+## Architecture Summary
+| Component | Value | Why |
+|-----------|-------|-----|
+| **Params** | ~150M | Competitive with ESM-2 150M |
+| **Layers** | 28 | Deep & narrow (NeoBERT/ModernBERT best practice) |
+| **Hidden** | 576 | Head dim = 64 (tensor core optimal) |
+| **Heads** | 9 | 576/9 = 64 |
+| **FFN** | 2304 | GeGLU (4× hidden) |
+| **Pos Emb** | RoPE (θ=10k) | Extrapolates to longer proteins |
+| **Norm** | Pre-LN | Stable at 28 layers |
+| **Dropout** | 0.0 | Following ESM-2 (data is noise enough) |
+| **Vocab** | 33 | ESM-2 compatible |
+| **Generator** | 320 hidden, 8L | 25% of discriminator (ELECTRA recipe) |
+**Discriminator params: ~150M | Generator params: ~25M**
+---
+## Stage 1: Pre-Training (ELECTRA)
+### Single GPU
+```bash
+CUDA_VISIBLE_DEVICES=0 bash run_pretrain.sh
+```
+### Multi-GPU (DDP)
+```bash
+# 4 GPUs
+torchrun --standalone --nnodes=1 --nproc_per_node=4 run_pretrain.sh
+```
+### SLURM
+```bash
+#SBATCH --gres=gpu:4
+#SBATCH --cpus-per-task=16
+#SBATCH --mem=128G
+module load cuda/12.1
+source ~/venv/bin/activate
+export NUM_GPUS=4
+export BATCH_SIZE=32        # Per-device
+export MAX_STEPS=500000
+export USE_AMP=1
+export USE_FLASH_ATTN=1
+bash run_pretrain.sh
+```
+### Key Environment Variables
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `NUM_GPUS` | 1 | Number of GPUs |
+| `BATCH_SIZE` | 64 | Per-device batch size |
+| `MAX_STEPS` | 100000 | Total training steps |
+| `LR` | 5e-4 | Peak learning rate |
+| `MASK_START` | 0.30 | Initial mask ratio |
+| `MASK_END` | 0.05 | Final mask ratio |
+| `USE_AMP` | 1 | bf16 mixed precision |
+| `USE_FLASH_ATTN` | 1 | FlashAttention (requires install) |
+| `GRADIENT_CHECKPOINTING` | 0 | Trade compute for memory |
+| `USE_TRACKIO` | 0 | Enable experiment tracking |
+### Data Sources
+Pre-training pulls from HuggingFace datasets by default:
+- `lamm-mit/protein_secondary_structure_from_PDB` (~126k sequences)
+- `adamstogsdill/pdb_protein_dataset_100_4000_1024`
+**For full pre-training**, set `USE_STREAMING=1` and add UniRef50/UniRef90:
+```bash
+export USE_STREAMING=1
+# Or provide local UniRef FASTA:
+export UNIREF_PATH=/path/to/uniref50.fasta
+```
+To add UniRef support, modify `load_sequences()` in `train_pretrain.py`:
+```python
+from Bio import SeqIO
+def load_uniref_fasta(path, max_seqs=5000000):
+    sequences = []
+    for record in SeqIO.parse(path, "fasta"):
+        seq = str(record.seq)
+        if len(seq) >= 20 and len(seq) <= 1024:
+            sequences.append(seq)
+        if len(sequences) >= max_seqs:
+            break
+    return sequences
+```
+### Expected Pre-Training Time
+| Hardware | Batch Size | Steps/Day | 100K Steps | 500K Steps |
+|----------|-----------|-----------|------------|------------|
+| 1× A100 80GB | 128 | ~50K | 2 days | 10 days |
+| 4× A100 80GB | 128×4 | ~200K | 12 hours | 2.5 days |
+| 8× A100 80GB | 128×8 | ~400K | 6 hours | ~30 hours |
+*With bf16 AMP and FlashAttention*
+---
+## Stage 2: Downstream Fine-Tuning
+After pre-training completes, fine-tune on specific tasks:
+```bash
+# Fine-tune on all available tasks
+bash run_finetune.sh
+# Or specific tasks
+PRETRAIN_DIR=./outputs/pretrain/final bash run_finetune.sh
+```
+### Supported Benchmark Tasks
+| Task | Type | Metric | Baseline (ESM-2 150M) | Target |
+|------|------|--------|----------------------|--------|
+| **Fluorescence** | Regression | Spearman ρ | 0.68 | ≥ 0.75 |
+| **Stability** | Regression | Spearman ρ | 0.79 | ≥ 0.85 |
+| **Solubility** | Classification | Accuracy | ~74% | ≥ 80% |
+| **Remote Homology** | Classification | Accuracy | ~20% | ≥ 25% |
+### Fine-Tuning Strategy
+The script uses **layer-wise learning rate decay**:
+- Task head: `lr`
+- Last 4 transformer layers: `lr × 0.5`
+- Earlier layers + embeddings: `lr × 0.1`
+This is critical for small downstream datasets (fluorescence has ~21k samples).
+For even smaller datasets, add LoRA:
+```bash
+# Install PEFT
+pip install peft
+# In train_finetune.py, replace full fine-tuning with:
+from peft import LoraConfig, get_peft_model
+lora_config = LoraConfig(
+    r=8, lora_alpha=16,
+    target_modules=["qkv_proj", "out_proj", "gate_proj", "up_proj", "down_proj"],
+    lora_dropout=0.0,
+    bias="none",
+)
+model = get_peft_model(model, lora_config)
+```
+---
+## Stage 3: Pushing to HuggingFace Hub
+After fine-tuning, push the pretrained encoder for community use:
+```python
+from modeling_modern_protein import ModernProteinLM
+from transformers import PreTrainedTokenizerFast
+# Load your trained model
+model = ModernProteinLM.from_pretrained("./outputs/pretrain/final")
+# Push to Hub
+model.push_to_hub("your-username/ModernProteinLM-150M")
+# With a task-specific head
+from modeling_modern_protein import ModernProteinLMForSequenceClassification
+cls_model = ModernProteinLMForSequenceClassification.from_pretrained(
+    "./outputs/finetune/fluorescence/best"
+)
+cls_model.push_to_hub("your-username/ModernProteinLM-fluorescence")
+```
+---
+## Expected Improvements Over ESM-2 150M
+| Technique | Source | Expected Gain |
+|-----------|--------|--------------|
+| ELECTRA vs MLM | ELECTRA paper | +3-5% on discriminative tasks |
+| GeGLU vs GELU | ModernBERT | +1-2% |
+| Deep & narrow (28L) | NeoBERT | +1-3% on embeddings |
+| Span masking | SpanBERT analogy | +1-2% on structure tasks |
+| Curriculum 30%→5% | mmBERT | Faster convergence |
+| **Combined (conservative)** | — | **+7-14% on predictive benchmarks** |
+---
+## Troubleshooting
+### OOM during pre-training
+```bash
+# Reduce per-device batch size
+export BATCH_SIZE=32
+# Enable gradient checkpointing
+export GRADIENT_CHECKPOINTING=1
+# Reduce sequence length
+export MAX_SEQ_LENGTH=512
+```
+### FlashAttention install fails
+```bash
+# Skip FlashAttention (slower but works)
+export USE_FLASH_ATTN=0
+# Or install from prebuilt wheel
+pip install flash-attn --find-links https://github.com/Dao-AILab/flash-attention/releases
+```
+### Slow data loading
+```bash
+# Increase workers
+export NUM_WORKERS=16
+# Pre-tokenize and cache
+python -c "
+from train_pretrain import load_sequences, ProteinTokenizer
+import pickle
+tokenizer = ProteinTokenizer()
+seqs = load_sequences(None)
+tokenized = [tokenizer.encode(s) for s in seqs]
+pickle.dump(tokenized, open('tokenized_cache.pkl', 'wb'))
+"
+```
+---
+## File Reference
+```
+modern_protein_lm/
+├── modeling_modern_protein.py    # Core architecture (ModernBERT-style + ELECTRA)
+├── train_pretrain.py             # ELECTRA pre-training (supports DDP, AMP)
+├── train_finetune.py             # Downstream fine-tuning (layer-wise LR)
+├── run_pretrain.sh               # Launch script for pre-training
+├── run_finetune.sh               # Launch script for fine-tuning
+├── requirements.txt              # Dependencies
+├── README.md                     # Architecture docs
+└── CLUSTER_INSTRUCTIONS.md       # This file
+```
+---
+## Citation
+If you use this architecture or achieve SOTA results, please cite:
+```bibtex
+@article{lin2023evolutionary,
+  title={Language models of protein sequences at the scale of evolution enable accurate structure prediction},
+  author={Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and others},
+  journal={Science},
+  year={2023}
+}
+@article{warner2024modernbert,
+  title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient and Long Context Finetuning and Inference},
+  author={Warner, Benjamin and Chalkidis, Ilias and Dadic, Jon Ander and others},
+  journal={arXiv preprint arXiv:2412.13663},
+  year={2024}
+}
+@inproceedings{clark2020electra,
+  title={ELECTRA: Pre-training text encoders as discriminators rather than generators},
+  author={Clark, Kevin and Luong, Minh-Thang and Le, Quoc V and Manning, Christopher D},
+  booktitle={ICLR},
+  year={2020}
+}
+```