ModernProteinLM / CLUSTER_INSTRUCTIONS.md

Upload CLUSTER_INSTRUCTIONS.md with huggingface_hub

36bbb76 verified 1 day ago

8.76 kB

	# ModernProteinLM — Private GPU Cluster Instructions

	## Overview

	ModernProteinLM is a next-generation protein encoder (<200M params) that combines:
	1. ModernBERT architecture (RoPE, Pre-LN, GeGLU, deep & narrow)
	2. ELECTRA discriminative pre-training (replaced token detection)
	3. Span masking curriculum (30% → 5% over training)

	This is the first protein encoder to combine all three proven techniques, targeting predictive downstream tasks (fluorescence, stability, solubility, structure, etc.).

	---

	## Quick Start

	```bash
	# 1. Clone / copy the codebase to your cluster
	# 2. Install dependencies
	pip install -r requirements.txt

	# 3. (Optional) Install FlashAttention for speedup
	pip install flash-attn --no-build-isolation

	# 4. Run pre-training
	bash run_pretrain.sh

	# 5. Run downstream fine-tuning + evaluation
	bash run_finetune.sh
	```

	---

	## Architecture Summary

	\| Component \| Value \| Why \|
	\|-----------\|-------\|-----\|
	\| Params \| ~150M \| Competitive with ESM-2 150M \|
	\| Layers \| 28 \| Deep & narrow (NeoBERT/ModernBERT best practice) \|
	\| Hidden \| 576 \| Head dim = 64 (tensor core optimal) \|
	\| Heads \| 9 \| 576/9 = 64 \|
	\| FFN \| 2304 \| GeGLU (4× hidden) \|
	\| Pos Emb \| RoPE (θ=10k) \| Extrapolates to longer proteins \|
	\| Norm \| Pre-LN \| Stable at 28 layers \|
	\| Dropout \| 0.0 \| Following ESM-2 (data is noise enough) \|
	\| Vocab \| 33 \| ESM-2 compatible \|
	\| Generator \| 320 hidden, 8L \| 25% of discriminator (ELECTRA recipe) \|

	Discriminator params: ~150M \| Generator params: ~25M

	---

	## Stage 1: Pre-Training (ELECTRA)

	### Single GPU

	```bash
	CUDA_VISIBLE_DEVICES=0 bash run_pretrain.sh
	```

	### Multi-GPU (DDP)

	```bash
	# 4 GPUs
	torchrun --standalone --nnodes=1 --nproc_per_node=4 run_pretrain.sh
	```

	### SLURM

	```bash
	#SBATCH --gres=gpu:4
	#SBATCH --cpus-per-task=16
	#SBATCH --mem=128G

	module load cuda/12.1
	source ~/venv/bin/activate

	export NUM_GPUS=4
	export BATCH_SIZE=32 # Per-device
	export MAX_STEPS=500000
	export USE_AMP=1
	export USE_FLASH_ATTN=1

	bash run_pretrain.sh
	```

	### Key Environment Variables

	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `NUM_GPUS` \| 1 \| Number of GPUs \|
	\| `BATCH_SIZE` \| 64 \| Per-device batch size \|
	\| `MAX_STEPS` \| 100000 \| Total training steps \|
	\| `LR` \| 5e-4 \| Peak learning rate \|
	\| `MASK_START` \| 0.30 \| Initial mask ratio \|
	\| `MASK_END` \| 0.05 \| Final mask ratio \|
	\| `USE_AMP` \| 1 \| bf16 mixed precision \|
	\| `USE_FLASH_ATTN` \| 1 \| FlashAttention (requires install) \|
	\| `GRADIENT_CHECKPOINTING` \| 0 \| Trade compute for memory \|
	\| `USE_TRACKIO` \| 0 \| Enable experiment tracking \|

	### Data Sources

	Pre-training pulls from HuggingFace datasets by default:
	- `lamm-mit/protein_secondary_structure_from_PDB` (~126k sequences)
	- `adamstogsdill/pdb_protein_dataset_100_4000_1024`

	For full pre-training, set `USE_STREAMING=1` and add UniRef50/UniRef90:

	```bash
	export USE_STREAMING=1
	# Or provide local UniRef FASTA:
	export UNIREF_PATH=/path/to/uniref50.fasta
	```

	To add UniRef support, modify `load_sequences()` in `train_pretrain.py`:

	```python
	from Bio import SeqIO

	def load_uniref_fasta(path, max_seqs=5000000):
	sequences = []
	for record in SeqIO.parse(path, "fasta"):
	seq = str(record.seq)
	if len(seq) >= 20 and len(seq) <= 1024:
	sequences.append(seq)
	if len(sequences) >= max_seqs:
	break
	return sequences
	```

	### Expected Pre-Training Time

	\| Hardware \| Batch Size \| Steps/Day \| 100K Steps \| 500K Steps \|
	\|----------\|-----------\|-----------\|------------\|------------\|
	\| 1× A100 80GB \| 128 \| ~50K \| 2 days \| 10 days \|
	\| 4× A100 80GB \| 128×4 \| ~200K \| 12 hours \| 2.5 days \|
	\| 8× A100 80GB \| 128×8 \| ~400K \| 6 hours \| ~30 hours \|

	With bf16 AMP and FlashAttention

	---

	## Stage 2: Downstream Fine-Tuning

	After pre-training completes, fine-tune on specific tasks:

	```bash
	# Fine-tune on all available tasks
	bash run_finetune.sh

	# Or specific tasks
	PRETRAIN_DIR=./outputs/pretrain/final bash run_finetune.sh
	```

	### Supported Benchmark Tasks

	\| Task \| Type \| Metric \| Baseline (ESM-2 150M) \| Target \|
	\|------\|------\|--------\|----------------------\|--------\|
	\| Fluorescence \| Regression \| Spearman ρ \| 0.68 \| ≥ 0.75 \|
	\| Stability \| Regression \| Spearman ρ \| 0.79 \| ≥ 0.85 \|
	\| Solubility \| Classification \| Accuracy \| ~74% \| ≥ 80% \|
	\| Remote Homology \| Classification \| Accuracy \| ~20% \| ≥ 25% \|

	### Fine-Tuning Strategy

	The script uses layer-wise learning rate decay:
	- Task head: `lr`
	- Last 4 transformer layers: `lr × 0.5`
	- Earlier layers + embeddings: `lr × 0.1`

	This is critical for small downstream datasets (fluorescence has ~21k samples).

	For even smaller datasets, add LoRA:

	```bash
	# Install PEFT
	pip install peft

	# In train_finetune.py, replace full fine-tuning with:
	from peft import LoraConfig, get_peft_model

	lora_config = LoraConfig(
	r=8, lora_alpha=16,
	target_modules=["qkv_proj", "out_proj", "gate_proj", "up_proj", "down_proj"],
	lora_dropout=0.0,
	bias="none",
	)
	model = get_peft_model(model, lora_config)
	```

	---

	## Stage 3: Pushing to HuggingFace Hub

	After fine-tuning, push the pretrained encoder for community use:

	```python
	from modeling_modern_protein import ModernProteinLM
	from transformers import PreTrainedTokenizerFast

	# Load your trained model
	model = ModernProteinLM.from_pretrained("./outputs/pretrain/final")

	# Push to Hub
	model.push_to_hub("your-username/ModernProteinLM-150M")

	# With a task-specific head
	from modeling_modern_protein import ModernProteinLMForSequenceClassification
	cls_model = ModernProteinLMForSequenceClassification.from_pretrained(
	"./outputs/finetune/fluorescence/best"
	)
	cls_model.push_to_hub("your-username/ModernProteinLM-fluorescence")
	```

	---

	## Expected Improvements Over ESM-2 150M

	\| Technique \| Source \| Expected Gain \|
	\|-----------\|--------\|--------------\|
	\| ELECTRA vs MLM \| ELECTRA paper \| +3-5% on discriminative tasks \|
	\| GeGLU vs GELU \| ModernBERT \| +1-2% \|
	\| Deep & narrow (28L) \| NeoBERT \| +1-3% on embeddings \|
	\| Span masking \| SpanBERT analogy \| +1-2% on structure tasks \|
	\| Curriculum 30%→5% \| mmBERT \| Faster convergence \|
	\| Combined (conservative) \| — \| +7-14% on predictive benchmarks \|

	---

	## Troubleshooting

	### OOM during pre-training

	```bash
	# Reduce per-device batch size
	export BATCH_SIZE=32

	# Enable gradient checkpointing
	export GRADIENT_CHECKPOINTING=1

	# Reduce sequence length
	export MAX_SEQ_LENGTH=512
	```

	### FlashAttention install fails

	```bash
	# Skip FlashAttention (slower but works)
	export USE_FLASH_ATTN=0

	# Or install from prebuilt wheel
	pip install flash-attn --find-links https://github.com/Dao-AILab/flash-attention/releases
	```

	### Slow data loading

	```bash
	# Increase workers
	export NUM_WORKERS=16

	# Pre-tokenize and cache
	python -c "
	from train_pretrain import load_sequences, ProteinTokenizer
	import pickle
	tokenizer = ProteinTokenizer()
	seqs = load_sequences(None)
	tokenized = [tokenizer.encode(s) for s in seqs]
	pickle.dump(tokenized, open('tokenized_cache.pkl', 'wb'))
	"
	```

	---

	## File Reference

	```
	modern_protein_lm/
	├── modeling_modern_protein.py # Core architecture (ModernBERT-style + ELECTRA)
	├── train_pretrain.py # ELECTRA pre-training (supports DDP, AMP)
	├── train_finetune.py # Downstream fine-tuning (layer-wise LR)
	├── run_pretrain.sh # Launch script for pre-training
	├── run_finetune.sh # Launch script for fine-tuning
	├── requirements.txt # Dependencies
	├── README.md # Architecture docs
	└── CLUSTER_INSTRUCTIONS.md # This file
	```

	---

	## Citation

	If you use this architecture or achieve SOTA results, please cite:

	```bibtex
	@article{lin2023evolutionary,
	title={Language models of protein sequences at the scale of evolution enable accurate structure prediction},
	author={Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and others},
	journal={Science},
	year={2023}
	}

	@article{warner2024modernbert,
	title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient and Long Context Finetuning and Inference},
	author={Warner, Benjamin and Chalkidis, Ilias and Dadic, Jon Ander and others},
	journal={arXiv preprint arXiv:2412.13663},
	year={2024}
	}

	@inproceedings{clark2020electra,
	title={ELECTRA: Pre-training text encoders as discriminators rather than generators},
	author={Clark, Kevin and Luong, Minh-Thang and Le, Quoc V and Manning, Christopher D},
	booktitle={ICLR},
	year={2020}
	}
	```