ModernProteinLM / CLUSTER_INSTRUCTIONS.md
GrimSqueaker's picture
Upload CLUSTER_INSTRUCTIONS.md with huggingface_hub
36bbb76 verified

ModernProteinLM β€” Private GPU Cluster Instructions

Overview

ModernProteinLM is a next-generation protein encoder (<200M params) that combines:

  1. ModernBERT architecture (RoPE, Pre-LN, GeGLU, deep & narrow)
  2. ELECTRA discriminative pre-training (replaced token detection)
  3. Span masking curriculum (30% β†’ 5% over training)

This is the first protein encoder to combine all three proven techniques, targeting predictive downstream tasks (fluorescence, stability, solubility, structure, etc.).


Quick Start

# 1. Clone / copy the codebase to your cluster
# 2. Install dependencies
pip install -r requirements.txt

# 3. (Optional) Install FlashAttention for speedup
pip install flash-attn --no-build-isolation

# 4. Run pre-training
bash run_pretrain.sh

# 5. Run downstream fine-tuning + evaluation
bash run_finetune.sh

Architecture Summary

Component Value Why
Params ~150M Competitive with ESM-2 150M
Layers 28 Deep & narrow (NeoBERT/ModernBERT best practice)
Hidden 576 Head dim = 64 (tensor core optimal)
Heads 9 576/9 = 64
FFN 2304 GeGLU (4Γ— hidden)
Pos Emb RoPE (ΞΈ=10k) Extrapolates to longer proteins
Norm Pre-LN Stable at 28 layers
Dropout 0.0 Following ESM-2 (data is noise enough)
Vocab 33 ESM-2 compatible
Generator 320 hidden, 8L 25% of discriminator (ELECTRA recipe)

Discriminator params: ~150M | Generator params: ~25M


Stage 1: Pre-Training (ELECTRA)

Single GPU

CUDA_VISIBLE_DEVICES=0 bash run_pretrain.sh

Multi-GPU (DDP)

# 4 GPUs
torchrun --standalone --nnodes=1 --nproc_per_node=4 run_pretrain.sh

SLURM

#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G

module load cuda/12.1
source ~/venv/bin/activate

export NUM_GPUS=4
export BATCH_SIZE=32        # Per-device
export MAX_STEPS=500000
export USE_AMP=1
export USE_FLASH_ATTN=1

bash run_pretrain.sh

Key Environment Variables

Variable Default Description
NUM_GPUS 1 Number of GPUs
BATCH_SIZE 64 Per-device batch size
MAX_STEPS 100000 Total training steps
LR 5e-4 Peak learning rate
MASK_START 0.30 Initial mask ratio
MASK_END 0.05 Final mask ratio
USE_AMP 1 bf16 mixed precision
USE_FLASH_ATTN 1 FlashAttention (requires install)
GRADIENT_CHECKPOINTING 0 Trade compute for memory
USE_TRACKIO 0 Enable experiment tracking

Data Sources

Pre-training pulls from HuggingFace datasets by default:

  • lamm-mit/protein_secondary_structure_from_PDB (~126k sequences)
  • adamstogsdill/pdb_protein_dataset_100_4000_1024

For full pre-training, set USE_STREAMING=1 and add UniRef50/UniRef90:

export USE_STREAMING=1
# Or provide local UniRef FASTA:
export UNIREF_PATH=/path/to/uniref50.fasta

To add UniRef support, modify load_sequences() in train_pretrain.py:

from Bio import SeqIO

def load_uniref_fasta(path, max_seqs=5000000):
    sequences = []
    for record in SeqIO.parse(path, "fasta"):
        seq = str(record.seq)
        if len(seq) >= 20 and len(seq) <= 1024:
            sequences.append(seq)
        if len(sequences) >= max_seqs:
            break
    return sequences

Expected Pre-Training Time

Hardware Batch Size Steps/Day 100K Steps 500K Steps
1Γ— A100 80GB 128 ~50K 2 days 10 days
4Γ— A100 80GB 128Γ—4 ~200K 12 hours 2.5 days
8Γ— A100 80GB 128Γ—8 ~400K 6 hours ~30 hours

With bf16 AMP and FlashAttention


Stage 2: Downstream Fine-Tuning

After pre-training completes, fine-tune on specific tasks:

# Fine-tune on all available tasks
bash run_finetune.sh

# Or specific tasks
PRETRAIN_DIR=./outputs/pretrain/final bash run_finetune.sh

Supported Benchmark Tasks

Task Type Metric Baseline (ESM-2 150M) Target
Fluorescence Regression Spearman ρ 0.68 β‰₯ 0.75
Stability Regression Spearman ρ 0.79 β‰₯ 0.85
Solubility Classification Accuracy ~74% β‰₯ 80%
Remote Homology Classification Accuracy ~20% β‰₯ 25%

Fine-Tuning Strategy

The script uses layer-wise learning rate decay:

  • Task head: lr
  • Last 4 transformer layers: lr Γ— 0.5
  • Earlier layers + embeddings: lr Γ— 0.1

This is critical for small downstream datasets (fluorescence has ~21k samples).

For even smaller datasets, add LoRA:

# Install PEFT
pip install peft

# In train_finetune.py, replace full fine-tuning with:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8, lora_alpha=16,
    target_modules=["qkv_proj", "out_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.0,
    bias="none",
)
model = get_peft_model(model, lora_config)

Stage 3: Pushing to HuggingFace Hub

After fine-tuning, push the pretrained encoder for community use:

from modeling_modern_protein import ModernProteinLM
from transformers import PreTrainedTokenizerFast

# Load your trained model
model = ModernProteinLM.from_pretrained("./outputs/pretrain/final")

# Push to Hub
model.push_to_hub("your-username/ModernProteinLM-150M")

# With a task-specific head
from modeling_modern_protein import ModernProteinLMForSequenceClassification
cls_model = ModernProteinLMForSequenceClassification.from_pretrained(
    "./outputs/finetune/fluorescence/best"
)
cls_model.push_to_hub("your-username/ModernProteinLM-fluorescence")

Expected Improvements Over ESM-2 150M

Technique Source Expected Gain
ELECTRA vs MLM ELECTRA paper +3-5% on discriminative tasks
GeGLU vs GELU ModernBERT +1-2%
Deep & narrow (28L) NeoBERT +1-3% on embeddings
Span masking SpanBERT analogy +1-2% on structure tasks
Curriculum 30%β†’5% mmBERT Faster convergence
Combined (conservative) β€” +7-14% on predictive benchmarks

Troubleshooting

OOM during pre-training

# Reduce per-device batch size
export BATCH_SIZE=32

# Enable gradient checkpointing
export GRADIENT_CHECKPOINTING=1

# Reduce sequence length
export MAX_SEQ_LENGTH=512

FlashAttention install fails

# Skip FlashAttention (slower but works)
export USE_FLASH_ATTN=0

# Or install from prebuilt wheel
pip install flash-attn --find-links https://github.com/Dao-AILab/flash-attention/releases

Slow data loading

# Increase workers
export NUM_WORKERS=16

# Pre-tokenize and cache
python -c "
from train_pretrain import load_sequences, ProteinTokenizer
import pickle
tokenizer = ProteinTokenizer()
seqs = load_sequences(None)
tokenized = [tokenizer.encode(s) for s in seqs]
pickle.dump(tokenized, open('tokenized_cache.pkl', 'wb'))
"

File Reference

modern_protein_lm/
β”œβ”€β”€ modeling_modern_protein.py    # Core architecture (ModernBERT-style + ELECTRA)
β”œβ”€β”€ train_pretrain.py             # ELECTRA pre-training (supports DDP, AMP)
β”œβ”€β”€ train_finetune.py             # Downstream fine-tuning (layer-wise LR)
β”œβ”€β”€ run_pretrain.sh               # Launch script for pre-training
β”œβ”€β”€ run_finetune.sh               # Launch script for fine-tuning
β”œβ”€β”€ requirements.txt              # Dependencies
β”œβ”€β”€ README.md                     # Architecture docs
└── CLUSTER_INSTRUCTIONS.md       # This file

Citation

If you use this architecture or achieve SOTA results, please cite:

@article{lin2023evolutionary,
  title={Language models of protein sequences at the scale of evolution enable accurate structure prediction},
  author={Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and others},
  journal={Science},
  year={2023}
}

@article{warner2024modernbert,
  title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient and Long Context Finetuning and Inference},
  author={Warner, Benjamin and Chalkidis, Ilias and Dadic, Jon Ander and others},
  journal={arXiv preprint arXiv:2412.13663},
  year={2024}
}

@inproceedings{clark2020electra,
  title={ELECTRA: Pre-training text encoders as discriminators rather than generators},
  author={Clark, Kevin and Luong, Minh-Thang and Le, Quoc V and Manning, Christopher D},
  booktitle={ICLR},
  year={2020}
}