# ModernProteinLM — Private GPU Cluster Instructions ## Overview ModernProteinLM is a next-generation protein encoder (<200M params) that combines: 1. **ModernBERT architecture** (RoPE, Pre-LN, GeGLU, deep & narrow) 2. **ELECTRA discriminative pre-training** (replaced token detection) 3. **Span masking curriculum** (30% → 5% over training) This is the **first protein encoder** to combine all three proven techniques, targeting predictive downstream tasks (fluorescence, stability, solubility, structure, etc.). --- ## Quick Start ```bash # 1. Clone / copy the codebase to your cluster # 2. Install dependencies pip install -r requirements.txt # 3. (Optional) Install FlashAttention for speedup pip install flash-attn --no-build-isolation # 4. Run pre-training bash run_pretrain.sh # 5. Run downstream fine-tuning + evaluation bash run_finetune.sh ``` --- ## Architecture Summary | Component | Value | Why | |-----------|-------|-----| | **Params** | ~150M | Competitive with ESM-2 150M | | **Layers** | 28 | Deep & narrow (NeoBERT/ModernBERT best practice) | | **Hidden** | 576 | Head dim = 64 (tensor core optimal) | | **Heads** | 9 | 576/9 = 64 | | **FFN** | 2304 | GeGLU (4× hidden) | | **Pos Emb** | RoPE (θ=10k) | Extrapolates to longer proteins | | **Norm** | Pre-LN | Stable at 28 layers | | **Dropout** | 0.0 | Following ESM-2 (data is noise enough) | | **Vocab** | 33 | ESM-2 compatible | | **Generator** | 320 hidden, 8L | 25% of discriminator (ELECTRA recipe) | **Discriminator params: ~150M | Generator params: ~25M** --- ## Stage 1: Pre-Training (ELECTRA) ### Single GPU ```bash CUDA_VISIBLE_DEVICES=0 bash run_pretrain.sh ``` ### Multi-GPU (DDP) ```bash # 4 GPUs torchrun --standalone --nnodes=1 --nproc_per_node=4 run_pretrain.sh ``` ### SLURM ```bash #SBATCH --gres=gpu:4 #SBATCH --cpus-per-task=16 #SBATCH --mem=128G module load cuda/12.1 source ~/venv/bin/activate export NUM_GPUS=4 export BATCH_SIZE=32 # Per-device export MAX_STEPS=500000 export USE_AMP=1 export USE_FLASH_ATTN=1 bash run_pretrain.sh ``` ### Key Environment Variables | Variable | Default | Description | |----------|---------|-------------| | `NUM_GPUS` | 1 | Number of GPUs | | `BATCH_SIZE` | 64 | Per-device batch size | | `MAX_STEPS` | 100000 | Total training steps | | `LR` | 5e-4 | Peak learning rate | | `MASK_START` | 0.30 | Initial mask ratio | | `MASK_END` | 0.05 | Final mask ratio | | `USE_AMP` | 1 | bf16 mixed precision | | `USE_FLASH_ATTN` | 1 | FlashAttention (requires install) | | `GRADIENT_CHECKPOINTING` | 0 | Trade compute for memory | | `USE_TRACKIO` | 0 | Enable experiment tracking | ### Data Sources Pre-training pulls from HuggingFace datasets by default: - `lamm-mit/protein_secondary_structure_from_PDB` (~126k sequences) - `adamstogsdill/pdb_protein_dataset_100_4000_1024` **For full pre-training**, set `USE_STREAMING=1` and add UniRef50/UniRef90: ```bash export USE_STREAMING=1 # Or provide local UniRef FASTA: export UNIREF_PATH=/path/to/uniref50.fasta ``` To add UniRef support, modify `load_sequences()` in `train_pretrain.py`: ```python from Bio import SeqIO def load_uniref_fasta(path, max_seqs=5000000): sequences = [] for record in SeqIO.parse(path, "fasta"): seq = str(record.seq) if len(seq) >= 20 and len(seq) <= 1024: sequences.append(seq) if len(sequences) >= max_seqs: break return sequences ``` ### Expected Pre-Training Time | Hardware | Batch Size | Steps/Day | 100K Steps | 500K Steps | |----------|-----------|-----------|------------|------------| | 1× A100 80GB | 128 | ~50K | 2 days | 10 days | | 4× A100 80GB | 128×4 | ~200K | 12 hours | 2.5 days | | 8× A100 80GB | 128×8 | ~400K | 6 hours | ~30 hours | *With bf16 AMP and FlashAttention* --- ## Stage 2: Downstream Fine-Tuning After pre-training completes, fine-tune on specific tasks: ```bash # Fine-tune on all available tasks bash run_finetune.sh # Or specific tasks PRETRAIN_DIR=./outputs/pretrain/final bash run_finetune.sh ``` ### Supported Benchmark Tasks | Task | Type | Metric | Baseline (ESM-2 150M) | Target | |------|------|--------|----------------------|--------| | **Fluorescence** | Regression | Spearman ρ | 0.68 | ≥ 0.75 | | **Stability** | Regression | Spearman ρ | 0.79 | ≥ 0.85 | | **Solubility** | Classification | Accuracy | ~74% | ≥ 80% | | **Remote Homology** | Classification | Accuracy | ~20% | ≥ 25% | ### Fine-Tuning Strategy The script uses **layer-wise learning rate decay**: - Task head: `lr` - Last 4 transformer layers: `lr × 0.5` - Earlier layers + embeddings: `lr × 0.1` This is critical for small downstream datasets (fluorescence has ~21k samples). For even smaller datasets, add LoRA: ```bash # Install PEFT pip install peft # In train_finetune.py, replace full fine-tuning with: from peft import LoraConfig, get_peft_model lora_config = LoraConfig( r=8, lora_alpha=16, target_modules=["qkv_proj", "out_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.0, bias="none", ) model = get_peft_model(model, lora_config) ``` --- ## Stage 3: Pushing to HuggingFace Hub After fine-tuning, push the pretrained encoder for community use: ```python from modeling_modern_protein import ModernProteinLM from transformers import PreTrainedTokenizerFast # Load your trained model model = ModernProteinLM.from_pretrained("./outputs/pretrain/final") # Push to Hub model.push_to_hub("your-username/ModernProteinLM-150M") # With a task-specific head from modeling_modern_protein import ModernProteinLMForSequenceClassification cls_model = ModernProteinLMForSequenceClassification.from_pretrained( "./outputs/finetune/fluorescence/best" ) cls_model.push_to_hub("your-username/ModernProteinLM-fluorescence") ``` --- ## Expected Improvements Over ESM-2 150M | Technique | Source | Expected Gain | |-----------|--------|--------------| | ELECTRA vs MLM | ELECTRA paper | +3-5% on discriminative tasks | | GeGLU vs GELU | ModernBERT | +1-2% | | Deep & narrow (28L) | NeoBERT | +1-3% on embeddings | | Span masking | SpanBERT analogy | +1-2% on structure tasks | | Curriculum 30%→5% | mmBERT | Faster convergence | | **Combined (conservative)** | — | **+7-14% on predictive benchmarks** | --- ## Troubleshooting ### OOM during pre-training ```bash # Reduce per-device batch size export BATCH_SIZE=32 # Enable gradient checkpointing export GRADIENT_CHECKPOINTING=1 # Reduce sequence length export MAX_SEQ_LENGTH=512 ``` ### FlashAttention install fails ```bash # Skip FlashAttention (slower but works) export USE_FLASH_ATTN=0 # Or install from prebuilt wheel pip install flash-attn --find-links https://github.com/Dao-AILab/flash-attention/releases ``` ### Slow data loading ```bash # Increase workers export NUM_WORKERS=16 # Pre-tokenize and cache python -c " from train_pretrain import load_sequences, ProteinTokenizer import pickle tokenizer = ProteinTokenizer() seqs = load_sequences(None) tokenized = [tokenizer.encode(s) for s in seqs] pickle.dump(tokenized, open('tokenized_cache.pkl', 'wb')) " ``` --- ## File Reference ``` modern_protein_lm/ ├── modeling_modern_protein.py # Core architecture (ModernBERT-style + ELECTRA) ├── train_pretrain.py # ELECTRA pre-training (supports DDP, AMP) ├── train_finetune.py # Downstream fine-tuning (layer-wise LR) ├── run_pretrain.sh # Launch script for pre-training ├── run_finetune.sh # Launch script for fine-tuning ├── requirements.txt # Dependencies ├── README.md # Architecture docs └── CLUSTER_INSTRUCTIONS.md # This file ``` --- ## Citation If you use this architecture or achieve SOTA results, please cite: ```bibtex @article{lin2023evolutionary, title={Language models of protein sequences at the scale of evolution enable accurate structure prediction}, author={Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and others}, journal={Science}, year={2023} } @article{warner2024modernbert, title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient and Long Context Finetuning and Inference}, author={Warner, Benjamin and Chalkidis, Ilias and Dadic, Jon Ander and others}, journal={arXiv preprint arXiv:2412.13663}, year={2024} } @inproceedings{clark2020electra, title={ELECTRA: Pre-training text encoders as discriminators rather than generators}, author={Clark, Kevin and Luong, Minh-Thang and Le, Quoc V and Manning, Christopher D}, booktitle={ICLR}, year={2020} } ```