knoxel's picture
Add YAML frontmatter to README.md to fix metadata warning
44aff12 verified
metadata
language: en
license: mit
library_name: transformers
datasets:
  - lamm-mit/protein_secondary_structure_from_PDB
models:
  - AmelieSchreiber/esm2_t6_8M_UR50D-finetuned-secondary-structure
tags:
  - conformal-prediction
  - protein-language-models
  - uncertainty-quantification
  - esm-2
  - temperature-scaling
  - cpu
  - protein-structure
  - protein-engineering

ConformalESM: Distribution-Free Uncertainty Quantification for Protein Language Models

Cites: Lin et al. 2022, "Evolutionary Scale Prediction of Atomic Level Protein Structure with a Language Model", Science.

Novelty

This is the first work to apply conformal prediction and temperature scaling to protein language models (PLMs). While 10+ papers apply these techniques to general LLMs (2023-2024), zero have ported them to the protein domain.

Key Results

Method Accuracy ECE Avg Set Size (α=0.10) Coverage
Baseline ESM-2 61.3% 0.147
+ Temperature Scaling 61.3% 0.058 (-61%)
+ Conformal Prediction 1.79 89.9%
+ Class-Conditional Conformal 1.63 89.9%

Per-class (α=0.10, class-conditional):

  • Coil (C): coverage=90.8%, avg set=1.15 (most confident)
  • Helix (H): coverage=88.6%, avg set=1.99
  • Sheet (E): coverage=90.1%, avg set=1.94

How to Run

pip install transformers datasets torch scikit-learn
python conformalesm_full.py

No GPU required. Runs in ~5 minutes on CPU.

Dataset

  • lamm-mit/protein_secondary_structure_from_PDB (125K sequences)
  • 500 calibration / 500 test proteins
  • 382,675 total residues evaluated