Longevity Protein Classifier v6
Fine-tuned ESM-2 150M for binary classification of protein sequences as longevity-associated or not, trained on multi-species GenAge data with LoRA adapters.
Built as part of a personal ML learning arc β Week 3 of 8 β connecting protein language models to longevity biology.
Model Description
- Model type: ESM-2 150M + LoRA (r=16) sequence classifier
- Base model: facebook/esm2_t30_150M_UR50D
- Task: Binary classification β longevity-associated vs non-longevity
- Developed by: Mo Elzek
- License: Apache 2.0
Performance
| Metric | Value |
|---|---|
| Test AUPRC | 0.335 |
| Test AUC-ROC | 0.696 |
| Random AUPRC baseline | 0.061 |
| Improvement over random | 5.5x |
| Training epochs | 10 (early stopping) |
Benchmark Results
| Protein | Score | Expected | Notes |
|---|---|---|---|
| SIRT1 | 0.996 | HIGH | NAD+ deacetylase, caloric restriction mediator |
| SIRT3 | 0.998 | HIGH | Mitochondrial sirtuin |
| TP53 | 0.974 | HIGH | Tumour suppressor, aging roles |
| MYH9 | 0.000 | LOW | Structural myosin β negative control |
| ACTB | 0.000 | LOW | Beta actin β negative control |
| ALB | 0.000 | LOW | Serum albumin β negative control |
| FOXO3 | 0.000 | HIGH | Fails β see limitations |
| MTOR | 0.000 | HIGH | Fails β see limitations |
| TERT | 0.000 | HIGH | Fails β see limitations |
Novel Predictions Not in GenAge
Proteins scoring above 0.50 that are not present in GenAge human database. These are the model's predictions of longevity-relevant proteins not yet catalogued β not validated findings.
| Protein | Score | Biological relevance |
|---|---|---|
| TFEB | 0.502 | Master regulator of autophagy and lysosomal biogenesis. Overexpression extends lifespan in C. elegans. Regulated by mTOR. Strongest novel prediction. |
| NEIL1 | 0.951 | DNA glycosylase, base excision repair of oxidative damage. DNA repair capacity correlates with species lifespan. |
| GSTA1 | 0.871 | Glutathione S-transferase. Antioxidant defence. GST family implicated in longevity across multiple species. |
| GRHL1 | 0.880 | Grainyhead-like transcription factor. Epithelial barrier maintenance β tissue integrity declines with age. |
| EXO1 | 0.550 | Exonuclease involved in DNA mismatch repair and double-strand break repair. |
| MSH4 | 0.546 | DNA mismatch repair. Related family members (MSH2, MSH6) are established longevity-associated genes. |
Recommended Thresholds
| Use case | Threshold | Precision | Recall |
|---|---|---|---|
| Screening β cast wide net | 0.05 | ~0.20 | ~29% |
| Balanced | 0.06 | ~0.41 | ~29% |
| High confidence hits only | 0.50 | ~0.61 | ~24% |
Optimised threshold from val set: 0.06 (F1: 0.358)
The model produces a bimodal distribution β proteins it recognises score very high (above 0.50), proteins it does not score near zero. The flat recall curve from 0.05 to 0.70 reflects this β most longevity proteins are either clearly found or clearly missed.
Known Limitations β Read Before Use
1. Protein length truncation
Sequences longer than 512 amino acids are truncated from the C-terminus. This causes systematic failures on long proteins where the functional domain sits in the C-terminal half:
- MTOR (2,549 aa): kinase domain at residues 2181-2431 β truncated away
- TERT (1,132 aa): reverse transcriptase domain at 600-900 β truncated away
Do not use this model to score proteins above 800 amino acids without validating on known examples from that protein family first.
2. Family-specific blind spots
The model learned sirtuin and tumour suppressor sequence features well but has insufficient training examples to generalise to:
- Forkhead transcription factors (FOXO3 scores 0.000 despite being a canonical longevity gene and fitting within the 512 aa window)
- Large kinases (truncation compounds this)
- Telomerase complex proteins
3. Direction of effect not captured
The model cannot distinguish between:
- Pro-longevity proteins (overexpression extends lifespan)
- Anti-aging-disease proteins (loss of function accelerates aging)
Both may score high. A high score means "associated with longevity biology" not "activating this protein extends lifespan."
4. Not validated experimentally
Novel predictions are model outputs only. No wet lab validation has been performed. TFEB is the strongest prediction based on prior literature but this model did not discover TFEB β it independently ranked it highly, consistent with existing biology.
5. Not for clinical use
This is a research screening tool. Do not use for any clinical, diagnostic, or therapeutic decision-making.
Training Data
Positive set: GenAge database (genomics.senescence.info)
- Human GenAge: 306 human longevity-associated genes
- Model organism GenAge: Pro-Longevity genes only from 4 species
- C. elegans: 283 genes
- D. melanogaster: 125 genes
- M. musculus: 85 genes
- Total positives: ~574
Negative set: Swiss-Prot reviewed proteins from same species
- Sampled proportionally per species (NEG_RATIO=10)
- Species weights applied: human 2.0x, mouse 1.5x, worm/fly 1.0x
- "Necessary for fitness" genes excluded from universe entirely
- Anti-Longevity genes excluded from positives
Filtering:
- Sequence length: 50-1500 amino acids
- Swiss-Prot reviewed only (manually curated)
Training Procedure
Architecture: ESM-2 150M + LoRA adapters
- LoRA rank: r=16, alpha=32, dropout=0.15
- Target modules: query, value attention projections
- Trainable parameters: ~4.7M of 150M total (3.1%)
Loss function: Focal loss with contrastive margin penalty
- gamma=1.0 (softer than standard gamma=2.0)
- Label smoothing=0.1
- Contrastive margin=0.30 (explicit separation penalty)
- Class weights: balanced
Optimiser: AdamW, lr=2e-4, weight_decay=0.01 Schedule: Cosine with warmup (10% warmup steps) Early stopping: Patience=4 on val AUPRC Best epoch: 10 of 20
Hardware: NVIDIA T4 16GB (Kaggle) Training time: ~2 hours
How to Use
from transformers import AutoTokenizer, EsmForSequenceClassification
from peft import PeftModel
import torch
# Load model
base = EsmForSequenceClassification.from_pretrained(
"facebook/esm2_t30_150M_UR50D",
num_labels=2,
ignore_mismatched_sizes=True
)
model = PeftModel.from_pretrained(base, "YOUR_USERNAME/longevity-esm2-v6")
tokenizer = AutoTokenizer.from_pretrained("YOUR_USERNAME/longevity-esm2-v6")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()
def score_sequence(sequence, threshold=0.06):
inputs = tokenizer(
sequence,
max_length=512,
padding="max_length",
truncation=True,
return_tensors="pt"
)
with torch.no_grad():
outputs = model(
input_ids=inputs["input_ids"].to(device),
attention_mask=inputs["attention_mask"].to(device)
)
prob = torch.softmax(outputs.logits, dim=1)[:, 1].item()
return {
"probability": round(prob, 4),
"prediction": "Longevity" if prob >= threshold else "Non-longevity",
"threshold": threshold,
"warning": "Truncated to 512 aa" if len(sequence) > 512 else None
}
# Example
result = score_sequence("MKTAYIAKQRQISFVK...")
print(result)
Recommended thresholds:
- 0.05-0.06 for screening (maximise recall)
- 0.50 for high-confidence hits only
Experiment History
This model is v6 in a series of iterative experiments:
| Version | Key change | Test AUPRC |
|---|---|---|
| v1 | Frozen encoder, 186 positives | Collapsed |
| v2 | LoRA r=8, 277 positives | 0.027 |
| v3 | ESM-2 150M, multi-species, ~2000 positives | 0.302 |
| v4 | Pro-Longevity filter, focal loss gamma=2 | 0.250 |
| v5 | Cleaned species, gamma=1, label smoothing | 0.323 |
| v6 (this) | Pathway-stratified split, contrastive margin | 0.335 |
Citation
If you use this model in research, please cite: @misc{elzek2026longevity, author = {Elzek, Mo}, title = {Longevity Protein Classifier: Multi-species ESM-2 Fine-tuning}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/YOUR_USERNAME/longevity-esm2-v6} }
Contact
Built by Mo Elzek as part of the London Longevity Network ML project arc.
Feedback and collaboration welcome.
Model tree for mawe2/longevity-esm2-v4
Base model
facebook/esm2_t30_150M_UR50D