Longevity Protein Classifier

Fine-tuned ESM-2 150M for binary classification of protein sequences as longevity-associated or not. Trained on multi-species GenAge data with LoRA adapters. Built to connect protein language models to longevity biology.

Performance

Metric Value
Test AUPRC 0.335
Test AUC-ROC 0.696
Random baseline AUPRC 0.061
Improvement over random 5.5x
Best epoch 10 of 20

Benchmark Results

Protein Score Pass/Fail Notes
SIRT1 0.996 PASS NAD+ deacetylase, caloric restriction
SIRT3 0.998 PASS Mitochondrial sirtuin
TP53 0.974 PASS Tumour suppressor, aging roles
MYH9 0.000 PASS Negative control β€” structural myosin
ACTB 0.000 PASS Negative control β€” beta actin
ALB 0.000 PASS Negative control β€” serum albumin
FOXO3 0.000 FAIL Known limitation β€” see below
MTOR 0.000 FAIL Known limitation β€” truncated at 512aa
TERT 0.000 FAIL Known limitation β€” truncated at 512aa

Novel Predictions Not in GenAge

Proteins scoring above 0.50 that are not in the GenAge human database. These are model predictions only β€” not experimentally validated.

Protein Score Biological relevance
NEIL1 0.951 DNA repair of oxidative damage. DNA repair capacity correlates with species lifespan
GRHL1 0.880 Epithelial barrier maintenance. Tissue integrity declines with age
GSTA1 0.871 Glutathione S-transferase antioxidant. GST family implicated in longevity across species
TFEB 0.502 Master regulator of autophagy and lysosomal biogenesis. Overexpression extends lifespan in C. elegans. Regulated by mTOR
EXO1 0.550 DNA mismatch repair exonuclease
MSH4 0.546 DNA mismatch repair. Related family members MSH2 and MSH6 are established longevity genes

TFEB is the strongest novel prediction. It is mechanistically connected to mTOR (already in GenAge), independently predicted at 0.502 by a model trained with no pathway information.

Recommended Thresholds

Use case Threshold
Screening β€” maximise recall 0.05
Balanced β€” default 0.06
High confidence hits only 0.50

Known Limitations

1. Protein length Sequences longer than 512 amino acids are truncated from the C-terminus. This causes failures on long proteins where the functional domain sits in the C-terminal half. MTOR (2,549 aa) and TERT (1,132 aa) both fail for this reason. Do not use this model on proteins above 800 amino acids without validating first.

2. Family-specific blind spots The model learned sirtuin and tumour suppressor sequence features well but has insufficient training examples to generalise to forkhead transcription factors. FOXO3 (402 aa, fits within 512 window) scores 0.000 despite being a canonical longevity gene. This is a training data coverage problem, not a truncation problem.

3. Direction of effect not captured The model cannot distinguish pro-longevity proteins (overexpression extends lifespan) from anti-aging-disease proteins (loss of function accelerates aging). A high score means associated with longevity biology, not activating this protein extends lifespan.

4. Not for clinical use Research screening tool only. Do not use for clinical, diagnostic, or therapeutic decisions.

How to Use

from transformers import AutoTokenizer, EsmForSequenceClassification
from peft import PeftModel
import torch

base = EsmForSequenceClassification.from_pretrained(
    "facebook/esm2_t30_150M_UR50D",
    num_labels=2,
    ignore_mismatched_sizes=True
)
model = PeftModel.from_pretrained(base, "mawe/longevity-esm2-v6")
tokenizer = AutoTokenizer.from_pretrained("mawe/longevity-esm2-v6")

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
model.eval()

def score_sequence(sequence, threshold=0.06):
    inputs = tokenizer(
        sequence,
        max_length=512,
        padding="max_length",
        truncation=True,
        return_tensors="pt"
    )
    with torch.no_grad():
        outputs = model(
            input_ids=inputs["input_ids"].to(device),
            attention_mask=inputs["attention_mask"].to(device)
        )
        prob = torch.softmax(outputs.logits, dim=1)[:, 1].item()
    return {
        "probability": round(prob, 4),
        "prediction": "Longevity" if prob >= threshold else "Non-longevity",
        "warning": "Sequence truncated to 512aa" if len(sequence) > 512 else None
    }

Training Details

Positive set: GenAge database

  • Human GenAge: 306 genes (all entries)
  • C. elegans Pro-Longevity: 283 genes
  • D. melanogaster Pro-Longevity: 125 genes
  • M. musculus Pro-Longevity: 85 genes
  • Total positives: 574

Negative set: Swiss-Prot reviewed proteins

  • NEG_RATIO: 10 negatives per positive
  • Species weights: human 2.0x, mouse 1.5x, worm and fly 1.0x
  • Necessary-for-fitness genes excluded from universe
  • Anti-Longevity genes excluded from positives

Architecture: ESM-2 150M + LoRA r=16, alpha=32, dropout=0.15

Loss: Focal loss gamma=1.0, label smoothing=0.1, contrastive margin=0.30

Optimiser: AdamW lr=2e-4, cosine schedule, 10% warmup

Hardware: NVIDIA T4 16GB on Kaggle

Experiment History

Version Key change Test AUPRC
v1 Frozen encoder, 186 positives Collapsed
v2 LoRA r=8, 277 positives 0.027
v3 ESM-2 150M, multi-species, 2000 positives 0.302
v4 Pro-Longevity filter, focal loss gamma=2 0.250
v5 Cleaned species, gamma=1, label smoothing 0.323
v6 Pathway-stratified split, contrastive margin 0.335

Contact

Feedback and collaboration welcome.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mawe2/longevity-esm2-v6

Finetuned
(17)
this model