ModernProteinLM / CLUSTER_INSTRUCTIONS.md

GrimSqueaker

Upload CLUSTER_INSTRUCTIONS.md with huggingface_hub

36bbb76 verified about 23 hours ago

preview code

raw

history blame contribute delete

8.76 kB

ModernProteinLM — Private GPU Cluster Instructions

Overview

ModernProteinLM is a next-generation protein encoder (<200M params) that combines:

ModernBERT architecture (RoPE, Pre-LN, GeGLU, deep & narrow)
ELECTRA discriminative pre-training (replaced token detection)
Span masking curriculum (30% → 5% over training)

This is the first protein encoder to combine all three proven techniques, targeting predictive downstream tasks (fluorescence, stability, solubility, structure, etc.).

Quick Start

# 1. Clone / copy the codebase to your cluster
# 2. Install dependencies
pip install -r requirements.txt

# 3. (Optional) Install FlashAttention for speedup
pip install flash-attn --no-build-isolation

# 4. Run pre-training
bash run_pretrain.sh

# 5. Run downstream fine-tuning + evaluation
bash run_finetune.sh

Architecture Summary

Component	Value	Why
Params	~150M	Competitive with ESM-2 150M
Layers	28	Deep & narrow (NeoBERT/ModernBERT best practice)
Hidden	576	Head dim = 64 (tensor core optimal)
Heads	9	576/9 = 64
FFN	2304	GeGLU (4× hidden)
Pos Emb	RoPE (θ=10k)	Extrapolates to longer proteins
Norm	Pre-LN	Stable at 28 layers
Dropout	0.0	Following ESM-2 (data is noise enough)
Vocab	33	ESM-2 compatible
Generator	320 hidden, 8L	25% of discriminator (ELECTRA recipe)

Discriminator params: ~150M | Generator params: ~25M

Stage 1: Pre-Training (ELECTRA)

Single GPU

CUDA_VISIBLE_DEVICES=0 bash run_pretrain.sh

Multi-GPU (DDP)

# 4 GPUs
torchrun --standalone --nnodes=1 --nproc_per_node=4 run_pretrain.sh

SLURM

#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G

module load cuda/12.1
source ~/venv/bin/activate

export NUM_GPUS=4
export BATCH_SIZE=32        # Per-device
export MAX_STEPS=500000
export USE_AMP=1
export USE_FLASH_ATTN=1

bash run_pretrain.sh

Key Environment Variables

Variable	Default	Description
`NUM_GPUS`	1	Number of GPUs
`BATCH_SIZE`	64	Per-device batch size
`MAX_STEPS`	100000	Total training steps
`LR`	5e-4	Peak learning rate
`MASK_START`	0.30	Initial mask ratio
`MASK_END`	0.05	Final mask ratio
`USE_AMP`	1	bf16 mixed precision
`USE_FLASH_ATTN`	1	FlashAttention (requires install)
`GRADIENT_CHECKPOINTING`	0	Trade compute for memory
`USE_TRACKIO`	0	Enable experiment tracking

Data Sources

Pre-training pulls from HuggingFace datasets by default:

lamm-mit/protein_secondary_structure_from_PDB (~126k sequences)
adamstogsdill/pdb_protein_dataset_100_4000_1024

For full pre-training, set USE_STREAMING=1 and add UniRef50/UniRef90:

export USE_STREAMING=1
# Or provide local UniRef FASTA:
export UNIREF_PATH=/path/to/uniref50.fasta

To add UniRef support, modify load_sequences() in train_pretrain.py:

from Bio import SeqIO

def load_uniref_fasta(path, max_seqs=5000000):
    sequences = []
    for record in SeqIO.parse(path, "fasta"):
        seq = str(record.seq)
        if len(seq) >= 20 and len(seq) <= 1024:
            sequences.append(seq)
        if len(sequences) >= max_seqs:
            break
    return sequences

Expected Pre-Training Time

Hardware	Batch Size	Steps/Day	100K Steps	500K Steps
1× A100 80GB	128	~50K	2 days	10 days
4× A100 80GB	128×4	~200K	12 hours	2.5 days
8× A100 80GB	128×8	~400K	6 hours	~30 hours

With bf16 AMP and FlashAttention

Stage 2: Downstream Fine-Tuning

After pre-training completes, fine-tune on specific tasks:

# Fine-tune on all available tasks
bash run_finetune.sh

# Or specific tasks
PRETRAIN_DIR=./outputs/pretrain/final bash run_finetune.sh

Supported Benchmark Tasks

Task	Type	Metric	Baseline (ESM-2 150M)	Target
Fluorescence	Regression	Spearman ρ	0.68	≥ 0.75
Stability	Regression	Spearman ρ	0.79	≥ 0.85
Solubility	Classification	Accuracy	~74%	≥ 80%
Remote Homology	Classification	Accuracy	~20%	≥ 25%

Fine-Tuning Strategy

The script uses layer-wise learning rate decay:

Task head: lr
Last 4 transformer layers: lr × 0.5
Earlier layers + embeddings: lr × 0.1

This is critical for small downstream datasets (fluorescence has ~21k samples).

For even smaller datasets, add LoRA:

# Install PEFT
pip install peft

# In train_finetune.py, replace full fine-tuning with:
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8, lora_alpha=16,
    target_modules=["qkv_proj", "out_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.0,
    bias="none",
)
model = get_peft_model(model, lora_config)

Stage 3: Pushing to HuggingFace Hub

After fine-tuning, push the pretrained encoder for community use:

from modeling_modern_protein import ModernProteinLM
from transformers import PreTrainedTokenizerFast

# Load your trained model
model = ModernProteinLM.from_pretrained("./outputs/pretrain/final")

# Push to Hub
model.push_to_hub("your-username/ModernProteinLM-150M")

# With a task-specific head
from modeling_modern_protein import ModernProteinLMForSequenceClassification
cls_model = ModernProteinLMForSequenceClassification.from_pretrained(
    "./outputs/finetune/fluorescence/best"
)
cls_model.push_to_hub("your-username/ModernProteinLM-fluorescence")

Expected Improvements Over ESM-2 150M

Technique	Source	Expected Gain
ELECTRA vs MLM	ELECTRA paper	+3-5% on discriminative tasks
GeGLU vs GELU	ModernBERT	+1-2%
Deep & narrow (28L)	NeoBERT	+1-3% on embeddings
Span masking	SpanBERT analogy	+1-2% on structure tasks
Curriculum 30%→5%	mmBERT	Faster convergence
Combined (conservative)	—	+7-14% on predictive benchmarks

Troubleshooting

OOM during pre-training

# Reduce per-device batch size
export BATCH_SIZE=32

# Enable gradient checkpointing
export GRADIENT_CHECKPOINTING=1

# Reduce sequence length
export MAX_SEQ_LENGTH=512

FlashAttention install fails

# Skip FlashAttention (slower but works)
export USE_FLASH_ATTN=0

# Or install from prebuilt wheel
pip install flash-attn --find-links https://github.com/Dao-AILab/flash-attention/releases

Slow data loading

# Increase workers
export NUM_WORKERS=16

# Pre-tokenize and cache
python -c "
from train_pretrain import load_sequences, ProteinTokenizer
import pickle
tokenizer = ProteinTokenizer()
seqs = load_sequences(None)
tokenized = [tokenizer.encode(s) for s in seqs]
pickle.dump(tokenized, open('tokenized_cache.pkl', 'wb'))
"

File Reference

modern_protein_lm/
├── modeling_modern_protein.py    # Core architecture (ModernBERT-style + ELECTRA)
├── train_pretrain.py             # ELECTRA pre-training (supports DDP, AMP)
├── train_finetune.py             # Downstream fine-tuning (layer-wise LR)
├── run_pretrain.sh               # Launch script for pre-training
├── run_finetune.sh               # Launch script for fine-tuning
├── requirements.txt              # Dependencies
├── README.md                     # Architecture docs
└── CLUSTER_INSTRUCTIONS.md       # This file

Citation

If you use this architecture or achieve SOTA results, please cite:

@article{lin2023evolutionary,
  title={Language models of protein sequences at the scale of evolution enable accurate structure prediction},
  author={Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and others},
  journal={Science},
  year={2023}
}

@article{warner2024modernbert,
  title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient and Long Context Finetuning and Inference},
  author={Warner, Benjamin and Chalkidis, Ilias and Dadic, Jon Ander and others},
  journal={arXiv preprint arXiv:2412.13663},
  year={2024}
}

@inproceedings{clark2020electra,
  title={ELECTRA: Pre-training text encoders as discriminators rather than generators},
  author={Clark, Kevin and Luong, Minh-Thang and Le, Quoc V and Manning, Christopher D},
  booktitle={ICLR},
  year={2020}
}