GrimSqueaker commited on
Commit
36bbb76
Β·
verified Β·
1 Parent(s): d76120a

Upload CLUSTER_INSTRUCTIONS.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. CLUSTER_INSTRUCTIONS.md +315 -0
CLUSTER_INSTRUCTIONS.md ADDED
@@ -0,0 +1,315 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ModernProteinLM β€” Private GPU Cluster Instructions
2
+
3
+ ## Overview
4
+
5
+ ModernProteinLM is a next-generation protein encoder (<200M params) that combines:
6
+ 1. **ModernBERT architecture** (RoPE, Pre-LN, GeGLU, deep & narrow)
7
+ 2. **ELECTRA discriminative pre-training** (replaced token detection)
8
+ 3. **Span masking curriculum** (30% β†’ 5% over training)
9
+
10
+ This is the **first protein encoder** to combine all three proven techniques, targeting predictive downstream tasks (fluorescence, stability, solubility, structure, etc.).
11
+
12
+ ---
13
+
14
+ ## Quick Start
15
+
16
+ ```bash
17
+ # 1. Clone / copy the codebase to your cluster
18
+ # 2. Install dependencies
19
+ pip install -r requirements.txt
20
+
21
+ # 3. (Optional) Install FlashAttention for speedup
22
+ pip install flash-attn --no-build-isolation
23
+
24
+ # 4. Run pre-training
25
+ bash run_pretrain.sh
26
+
27
+ # 5. Run downstream fine-tuning + evaluation
28
+ bash run_finetune.sh
29
+ ```
30
+
31
+ ---
32
+
33
+ ## Architecture Summary
34
+
35
+ | Component | Value | Why |
36
+ |-----------|-------|-----|
37
+ | **Params** | ~150M | Competitive with ESM-2 150M |
38
+ | **Layers** | 28 | Deep & narrow (NeoBERT/ModernBERT best practice) |
39
+ | **Hidden** | 576 | Head dim = 64 (tensor core optimal) |
40
+ | **Heads** | 9 | 576/9 = 64 |
41
+ | **FFN** | 2304 | GeGLU (4Γ— hidden) |
42
+ | **Pos Emb** | RoPE (ΞΈ=10k) | Extrapolates to longer proteins |
43
+ | **Norm** | Pre-LN | Stable at 28 layers |
44
+ | **Dropout** | 0.0 | Following ESM-2 (data is noise enough) |
45
+ | **Vocab** | 33 | ESM-2 compatible |
46
+ | **Generator** | 320 hidden, 8L | 25% of discriminator (ELECTRA recipe) |
47
+
48
+ **Discriminator params: ~150M | Generator params: ~25M**
49
+
50
+ ---
51
+
52
+ ## Stage 1: Pre-Training (ELECTRA)
53
+
54
+ ### Single GPU
55
+
56
+ ```bash
57
+ CUDA_VISIBLE_DEVICES=0 bash run_pretrain.sh
58
+ ```
59
+
60
+ ### Multi-GPU (DDP)
61
+
62
+ ```bash
63
+ # 4 GPUs
64
+ torchrun --standalone --nnodes=1 --nproc_per_node=4 run_pretrain.sh
65
+ ```
66
+
67
+ ### SLURM
68
+
69
+ ```bash
70
+ #SBATCH --gres=gpu:4
71
+ #SBATCH --cpus-per-task=16
72
+ #SBATCH --mem=128G
73
+
74
+ module load cuda/12.1
75
+ source ~/venv/bin/activate
76
+
77
+ export NUM_GPUS=4
78
+ export BATCH_SIZE=32 # Per-device
79
+ export MAX_STEPS=500000
80
+ export USE_AMP=1
81
+ export USE_FLASH_ATTN=1
82
+
83
+ bash run_pretrain.sh
84
+ ```
85
+
86
+ ### Key Environment Variables
87
+
88
+ | Variable | Default | Description |
89
+ |----------|---------|-------------|
90
+ | `NUM_GPUS` | 1 | Number of GPUs |
91
+ | `BATCH_SIZE` | 64 | Per-device batch size |
92
+ | `MAX_STEPS` | 100000 | Total training steps |
93
+ | `LR` | 5e-4 | Peak learning rate |
94
+ | `MASK_START` | 0.30 | Initial mask ratio |
95
+ | `MASK_END` | 0.05 | Final mask ratio |
96
+ | `USE_AMP` | 1 | bf16 mixed precision |
97
+ | `USE_FLASH_ATTN` | 1 | FlashAttention (requires install) |
98
+ | `GRADIENT_CHECKPOINTING` | 0 | Trade compute for memory |
99
+ | `USE_TRACKIO` | 0 | Enable experiment tracking |
100
+
101
+ ### Data Sources
102
+
103
+ Pre-training pulls from HuggingFace datasets by default:
104
+ - `lamm-mit/protein_secondary_structure_from_PDB` (~126k sequences)
105
+ - `adamstogsdill/pdb_protein_dataset_100_4000_1024`
106
+
107
+ **For full pre-training**, set `USE_STREAMING=1` and add UniRef50/UniRef90:
108
+
109
+ ```bash
110
+ export USE_STREAMING=1
111
+ # Or provide local UniRef FASTA:
112
+ export UNIREF_PATH=/path/to/uniref50.fasta
113
+ ```
114
+
115
+ To add UniRef support, modify `load_sequences()` in `train_pretrain.py`:
116
+
117
+ ```python
118
+ from Bio import SeqIO
119
+
120
+ def load_uniref_fasta(path, max_seqs=5000000):
121
+ sequences = []
122
+ for record in SeqIO.parse(path, "fasta"):
123
+ seq = str(record.seq)
124
+ if len(seq) >= 20 and len(seq) <= 1024:
125
+ sequences.append(seq)
126
+ if len(sequences) >= max_seqs:
127
+ break
128
+ return sequences
129
+ ```
130
+
131
+ ### Expected Pre-Training Time
132
+
133
+ | Hardware | Batch Size | Steps/Day | 100K Steps | 500K Steps |
134
+ |----------|-----------|-----------|------------|------------|
135
+ | 1Γ— A100 80GB | 128 | ~50K | 2 days | 10 days |
136
+ | 4Γ— A100 80GB | 128Γ—4 | ~200K | 12 hours | 2.5 days |
137
+ | 8Γ— A100 80GB | 128Γ—8 | ~400K | 6 hours | ~30 hours |
138
+
139
+ *With bf16 AMP and FlashAttention*
140
+
141
+ ---
142
+
143
+ ## Stage 2: Downstream Fine-Tuning
144
+
145
+ After pre-training completes, fine-tune on specific tasks:
146
+
147
+ ```bash
148
+ # Fine-tune on all available tasks
149
+ bash run_finetune.sh
150
+
151
+ # Or specific tasks
152
+ PRETRAIN_DIR=./outputs/pretrain/final bash run_finetune.sh
153
+ ```
154
+
155
+ ### Supported Benchmark Tasks
156
+
157
+ | Task | Type | Metric | Baseline (ESM-2 150M) | Target |
158
+ |------|------|--------|----------------------|--------|
159
+ | **Fluorescence** | Regression | Spearman ρ | 0.68 | β‰₯ 0.75 |
160
+ | **Stability** | Regression | Spearman ρ | 0.79 | β‰₯ 0.85 |
161
+ | **Solubility** | Classification | Accuracy | ~74% | β‰₯ 80% |
162
+ | **Remote Homology** | Classification | Accuracy | ~20% | β‰₯ 25% |
163
+
164
+ ### Fine-Tuning Strategy
165
+
166
+ The script uses **layer-wise learning rate decay**:
167
+ - Task head: `lr`
168
+ - Last 4 transformer layers: `lr Γ— 0.5`
169
+ - Earlier layers + embeddings: `lr Γ— 0.1`
170
+
171
+ This is critical for small downstream datasets (fluorescence has ~21k samples).
172
+
173
+ For even smaller datasets, add LoRA:
174
+
175
+ ```bash
176
+ # Install PEFT
177
+ pip install peft
178
+
179
+ # In train_finetune.py, replace full fine-tuning with:
180
+ from peft import LoraConfig, get_peft_model
181
+
182
+ lora_config = LoraConfig(
183
+ r=8, lora_alpha=16,
184
+ target_modules=["qkv_proj", "out_proj", "gate_proj", "up_proj", "down_proj"],
185
+ lora_dropout=0.0,
186
+ bias="none",
187
+ )
188
+ model = get_peft_model(model, lora_config)
189
+ ```
190
+
191
+ ---
192
+
193
+ ## Stage 3: Pushing to HuggingFace Hub
194
+
195
+ After fine-tuning, push the pretrained encoder for community use:
196
+
197
+ ```python
198
+ from modeling_modern_protein import ModernProteinLM
199
+ from transformers import PreTrainedTokenizerFast
200
+
201
+ # Load your trained model
202
+ model = ModernProteinLM.from_pretrained("./outputs/pretrain/final")
203
+
204
+ # Push to Hub
205
+ model.push_to_hub("your-username/ModernProteinLM-150M")
206
+
207
+ # With a task-specific head
208
+ from modeling_modern_protein import ModernProteinLMForSequenceClassification
209
+ cls_model = ModernProteinLMForSequenceClassification.from_pretrained(
210
+ "./outputs/finetune/fluorescence/best"
211
+ )
212
+ cls_model.push_to_hub("your-username/ModernProteinLM-fluorescence")
213
+ ```
214
+
215
+ ---
216
+
217
+ ## Expected Improvements Over ESM-2 150M
218
+
219
+ | Technique | Source | Expected Gain |
220
+ |-----------|--------|--------------|
221
+ | ELECTRA vs MLM | ELECTRA paper | +3-5% on discriminative tasks |
222
+ | GeGLU vs GELU | ModernBERT | +1-2% |
223
+ | Deep & narrow (28L) | NeoBERT | +1-3% on embeddings |
224
+ | Span masking | SpanBERT analogy | +1-2% on structure tasks |
225
+ | Curriculum 30%β†’5% | mmBERT | Faster convergence |
226
+ | **Combined (conservative)** | β€” | **+7-14% on predictive benchmarks** |
227
+
228
+ ---
229
+
230
+ ## Troubleshooting
231
+
232
+ ### OOM during pre-training
233
+
234
+ ```bash
235
+ # Reduce per-device batch size
236
+ export BATCH_SIZE=32
237
+
238
+ # Enable gradient checkpointing
239
+ export GRADIENT_CHECKPOINTING=1
240
+
241
+ # Reduce sequence length
242
+ export MAX_SEQ_LENGTH=512
243
+ ```
244
+
245
+ ### FlashAttention install fails
246
+
247
+ ```bash
248
+ # Skip FlashAttention (slower but works)
249
+ export USE_FLASH_ATTN=0
250
+
251
+ # Or install from prebuilt wheel
252
+ pip install flash-attn --find-links https://github.com/Dao-AILab/flash-attention/releases
253
+ ```
254
+
255
+ ### Slow data loading
256
+
257
+ ```bash
258
+ # Increase workers
259
+ export NUM_WORKERS=16
260
+
261
+ # Pre-tokenize and cache
262
+ python -c "
263
+ from train_pretrain import load_sequences, ProteinTokenizer
264
+ import pickle
265
+ tokenizer = ProteinTokenizer()
266
+ seqs = load_sequences(None)
267
+ tokenized = [tokenizer.encode(s) for s in seqs]
268
+ pickle.dump(tokenized, open('tokenized_cache.pkl', 'wb'))
269
+ "
270
+ ```
271
+
272
+ ---
273
+
274
+ ## File Reference
275
+
276
+ ```
277
+ modern_protein_lm/
278
+ β”œβ”€β”€ modeling_modern_protein.py # Core architecture (ModernBERT-style + ELECTRA)
279
+ β”œβ”€β”€ train_pretrain.py # ELECTRA pre-training (supports DDP, AMP)
280
+ β”œβ”€β”€ train_finetune.py # Downstream fine-tuning (layer-wise LR)
281
+ β”œβ”€β”€ run_pretrain.sh # Launch script for pre-training
282
+ β”œβ”€β”€ run_finetune.sh # Launch script for fine-tuning
283
+ β”œβ”€β”€ requirements.txt # Dependencies
284
+ β”œβ”€β”€ README.md # Architecture docs
285
+ └── CLUSTER_INSTRUCTIONS.md # This file
286
+ ```
287
+
288
+ ---
289
+
290
+ ## Citation
291
+
292
+ If you use this architecture or achieve SOTA results, please cite:
293
+
294
+ ```bibtex
295
+ @article{lin2023evolutionary,
296
+ title={Language models of protein sequences at the scale of evolution enable accurate structure prediction},
297
+ author={Lin, Zeming and Akin, Halil and Rao, Roshan and Hie, Brian and Zhu, Zhongkai and Lu, Wenting and Smetanin, Nikita and Verkuil, Robert and Kabeli, Ori and Shmueli, Yaniv and others},
298
+ journal={Science},
299
+ year={2023}
300
+ }
301
+
302
+ @article{warner2024modernbert,
303
+ title={Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient and Long Context Finetuning and Inference},
304
+ author={Warner, Benjamin and Chalkidis, Ilias and Dadic, Jon Ander and others},
305
+ journal={arXiv preprint arXiv:2412.13663},
306
+ year={2024}
307
+ }
308
+
309
+ @inproceedings{clark2020electra,
310
+ title={ELECTRA: Pre-training text encoders as discriminators rather than generators},
311
+ author={Clark, Kevin and Luong, Minh-Thang and Le, Quoc V and Manning, Christopher D},
312
+ booktitle={ICLR},
313
+ year={2020}
314
+ }
315
+ ```