| # ModernProteinLM β Private GPU Cluster Instructions |
|
|
| ## Overview |
|
|
| ModernProteinLM is a next-generation protein encoder (<200M params) that combines: |
| 1. **ModernBERT architecture** (RoPE, Pre-LN, GeGLU, deep & narrow) |
| 2. **ELECTRA discriminative pre-training** (replaced token detection) |
| 3. **Span masking curriculum** (30% β 5% over training) |
|
|
| This is the **first protein encoder** to combine all three proven techniques, targeting predictive downstream tasks (fluorescence, stability, solubility, structure, etc.). |
|
|
| --- |
|
|
| ## Quick Start |
|
|
| ```bash |
| # 1. Clone / copy the codebase to your cluster |
| # 2. Install dependencies |
| pip install -r requirements.txt |
| |
| # 3. (Optional) Install FlashAttention for speedup |
| pip install flash-attn --no-build-isolation |
| |
| # 4. Run pre-training |
| bash run_pretrain.sh |
| |
| # 5. Run downstream fine-tuning + evaluation |
| bash run_finetune.sh |
| ``` |
|
|
| --- |
|
|
| ## Architecture Summary |
|
|
| | Component | Value | Why | |
| |-----------|-------|-----| |
| | **Params** | ~150M | Competitive with ESM-2 150M | |
| | **Layers** | 28 | Deep & narrow (NeoBERT/ModernBERT best practice) | |
| | **Hidden** | 576 | Head dim = 64 (tensor core optimal) | |
| | **Heads** | 9 | 576/9 = 64 | |
| | **FFN** | 2304 | GeGLU (4Γ hidden) | |
| | **Pos Emb** | RoPE (ΞΈ=10k) | Extrapolates to longer proteins | |
| | **Norm** | Pre-LN | Stable at 28 layers | |
| | **Dropout** | 0.0 | Following ESM-2 (data is noise enough) | |
| | **Vocab** | 33 | ESM-2 compatible | |
| | **Generator** | 320 hidden, 8L | 25% of discriminator (ELECTRA recipe) | |
|
|
| **Discriminator params: ~150M | Generator params: ~25M** |
|
|
| --- |
|
|
| ## Stage 1: Pre-Training (ELECTRA) |
|
|
| ### Single GPU |
|
|
| ```bash |
| CUDA_VISIBLE_DEVICES=0 bash run_pretrain.sh |
| ``` |
|
|
| ### Multi-GPU (DDP) |
|
|
| ```bash |
| # 4 GPUs |
| torchrun --standalone --nnodes=1 --nproc_per_node=4 run_pretrain.sh |
| ``` |
|
|
| ### SLURM |
|
|
| ```bash |
| #SBATCH --gres=gpu:4 |
| #SBATCH --cpus-per-task=16 |
| #SBATCH --mem=128G |
| |
| module load cuda/12.1 |
| source ~/venv/bin/activate |
| |
| export NUM_GPUS=4 |
| export BATCH_SIZE=32 # Per-device |
| export MAX_STEPS=500000 |
| export USE_AMP=1 |
| export USE_FLASH_ATTN=1 |
| |
| bash run_pretrain.sh |
| ``` |
|
|
| ### Key Environment Variables |
|
|
| | Variable | Default | Description | |
| |----------|---------|-------------| |
| | `NUM_GPUS` | 1 | Number of GPUs | |
| | `BATCH_SIZE` | 64 | Per-device batch size | |
| | `MAX_STEPS` | 100000 | Total training steps | |
| | `LR` | 5e-4 | Peak learning rate | |
| | `MASK_START` | 0.30 | Initial mask ratio | |
| | `MASK_END` | 0.05 | Final mask ratio | |
| | `USE_AMP` | 1 | bf16 mixed precision | |
| | `USE_FLASH_ATTN` | 1 | FlashAttention (requires install) | |
| | `GRADIENT_CHECKPOINTING` | 0 | Trade compute for memory | |
| | `USE_TRACKIO` | 0 | Enable experiment tracking | |
|
|
| ### Data Sources |
|
|
| Pre-training pulls from HuggingFace datasets by default: |
| - `lamm-mit/protein_secondary_structure_from_PDB` (~126k sequences) |
| - `adamstogsdill/pdb_protein_dataset_100_4000_1024` |
|
|
| **For full pre-training**, set `USE_STREAMING=1` and add UniRef50/UniRef90: |
|
|
| ```bash |
| export USE_STREAMING=1 |
| # Or provide local UniRef FASTA: |
| export UNIREF_PATH=/path/to/uniref50.fasta |
| ``` |
|
|
| To add UniRef support, modify `load_sequences()` in `train_pretrain.py`: |
|
|
| ```python |
| from Bio import SeqIO |
| |
| def load_uniref_fasta(path, max_seqs=5000000): |
| sequences = [] |
| for record in SeqIO.parse(path, "fasta"): |
| seq = str(record.seq) |
| if len(seq) >= 20 and len(seq) <= 1024: |
| sequences.append(seq) |
| if len(sequences) >= max_seqs: |
| break |
| return sequences |
| ``` |
|
|
| ### Expected Pre-Training Time |
|
|
| | Hardware | Batch Size | Steps/Day | 100K Steps | 500K Steps | |
| |----------|-----------|-----------|------------|------------| |
| | 1Γ A100 80GB | 128 | ~50K | 2 days | 10 days | |
| | 4Γ A100 80GB | 128Γ4 | ~200K | 12 hours | 2.5 days | |
| | 8Γ A100 80GB | 128Γ8 | ~400K | 6 hours | ~30 hours | |
|
|
| *With bf16 AMP and FlashAttention* |
|
|
| --- |
|
|
| ## Stage 2: Downstream Fine-Tuning |
|
|
| After pre-training completes, fine-tune on specific tasks: |
|
|
| ```bash |
| # Fine-tune on all available tasks |
| bash run_finetune.sh |
| |
| # Or specific tasks |
| PRETRAIN_DIR=./outputs/pretrain/final bash run_finetune.sh |
| ``` |
|
|
| ### Supported Benchmark Tasks |
|
|
| | Task | Type | Metric | Baseline (ESM-2 150M) | Target | |
| |------|------|--------|----------------------|--------| |
| | **Fluorescence** | Regression | Spearman Ο | 0.68 | β₯ 0.75 | |
| | **Stability** | Regression | Spearman Ο | 0.79 | β₯ 0.85 | |
| | **Solubility** | Classification | Accuracy | ~74% | β₯ 80% | |
| | **Remote Homology** | Classification | Accuracy | ~20% | β₯ 25% | |
|
|
| ### Fine-Tuning Strategy |
|
|
| The script uses **layer-wise learning rate decay**: |
| - Task head: `lr` |
| - Last 4 transformer layers: `lr Γ 0.5` |
| - Earlier layers + embeddings: `lr Γ 0.1` |
|
|
| This is critical for small downstream datasets (fluorescence has ~21k samples). |
|
|
| For even smaller datasets, add LoRA: |
|
|
| ```bash |
| # Install PEFT |
| pip install peft |
| |
| # In train_finetune.py, replace full fine-tuning with: |
| from peft import LoraConfig, get_peft_model |
| |
| lora_config = LoraConfig( |
| r=8, lora_alpha=16, |
| target_modules=["qkv_proj", "out_proj", "gate_proj", "up_proj", "down_proj"], |
| lora_dropout=0.0, |
| bias="none", |
| ) |
| model = get_peft_model(model, lora_config) |
| ``` |
|
|
| --- |
|
|
| ## Stage 3: Pushing to HuggingFace Hub |
|
|
| After fine-tuning, push the pretrained encoder for community use: |
|
|
| ```python |
| from modeling_modern_protein import ModernProteinLM |
| from transformers import PreTrainedTokenizerFast |
| |
| # Load your trained model |
| model = ModernProteinLM.from_pretrained("./outputs/pretrain/final") |
| |
| # Push to Hub |
| model.push_to_hub("your-username/ModernProteinLM-150M") |
| |
| # With a task-specific head |
| from modeling_modern_protein import ModernProteinLMForSequenceClassification |
| cls_model = ModernProteinLMForSequenceClassification.from_pretrained( |
| "./outputs/finetune/fluorescence/best" |
| ) |
| cls_model.push_to_hub("your-username/ModernProteinLM-fluorescence") |
| ``` |
|
|
| --- |
|
|
| ## Expected Improvements Over ESM-2 150M |
|
|
| | Technique | Source | Expected Gain | |
| |-----------|--------|--------------| |
| | ELECTRA vs MLM | ELECTRA paper | +3-5% on discriminative tasks | |
| | GeGLU vs GELU | ModernBERT | +1-2% | |
| | Deep & narrow (28L) | NeoBERT | +1-3% on embeddings | |
| | Span masking | SpanBERT analogy | +1-2% on structure tasks | |
| | Curriculum 30%β5% | mmBERT | Faster convergence | |
| | **Combined (conservative)** | β | **+7-14% on predictive benchmarks** | |
|
|
| --- |
|
|
| ## Troubleshooting |
|
|
| ### OOM during pre-training |
|
|
| ```bash |
| # Reduce per-device batch size |
| export BATCH_SIZE=32 |
| |
| # Enable gradient checkpointing |
| export GRADIENT_CHECKPOINTING=1 |
| |
| # Reduce sequence length |
| export MAX_SEQ_LENGTH=512 |
| ``` |
|
|
| ### FlashAttention install fails |
|
|
| ```bash |
| # Skip FlashAttention (slower but works) |
| export USE_FLASH_ATTN=0 |
| |
| # Or install from prebuilt wheel |
| pip install flash-attn --find-links https://github.com/Dao-AILab/flash-attention/releases |
| ``` |
|
|
| ### Slow data loading |
|
|
| ```bash |
| # Increase workers |
| export NUM_WORKERS=16 |
| |
| # Pre-tokenize and cache |
| python -c " |
| from train_pretrain import load_sequences, ProteinTokenizer |
| import pickle |
| tokenizer = ProteinTokenizer() |
| seqs = load_sequences(None) |
| tokenized = [tokenizer.encode(s) for s in seqs] |
| pickle.dump(tokenized, open('tokenized_cache.pkl', 'wb')) |
| " |
| ``` |
|
|
| --- |
|
|
| ## File Reference |
|
|
| ``` |
| modern_protein_lm/ |
| βββ modeling_modern_protein.py # Core architecture (ModernBERT-style + ELECTRA) |
| βββ train_pretrain.py # ELECTRA pre-training (supports DDP, AMP) |
| βββ train_finetune.py # Downstream fine-tuning (layer-wise LR) |
| βββ run_pretrain.sh # Launch script for pre-training |
| βββ run_finetune.sh # Launch script for fine-tuning |
| βββ requirements.txt # Dependencies |
| βββ README.md # Architecture docs |
| βββ CLUSTER_INSTRUCTIONS.md # This file |
| ``` |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this architecture or achieve SOTA results, please cite: |
|
|
| ```bibtex |
| @article{lin2023evolutionary, |
| title={Language models of protein sequences at the scale of evolution enable accurate structure prediction}, |
| author={Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and others}, |
| journal={Science}, |
| year={2023} |
| } |
| |
| @article{warner2024modernbert, |
| title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient and Long Context Finetuning and Inference}, |
| author={Warner, Benjamin and Chalkidis, Ilias and Dadic, Jon Ander and others}, |
| journal={arXiv preprint arXiv:2412.13663}, |
| year={2024} |
| } |
| |
| @inproceedings{clark2020electra, |
| title={ELECTRA: Pre-training text encoders as discriminators rather than generators}, |
| author={Clark, Kevin and Luong, Minh-Thang and Le, Quoc V and Manning, Christopher D}, |
| booktitle={ICLR}, |
| year={2020} |
| } |
| ``` |
|
|