knoxel's picture
Add YAML frontmatter to README.md to fix metadata warning
44aff12 verified
---
language: en
license: mit
library_name: transformers
datasets:
- lamm-mit/protein_secondary_structure_from_PDB
models:
- AmelieSchreiber/esm2_t6_8M_UR50D-finetuned-secondary-structure
tags:
- conformal-prediction
- protein-language-models
- uncertainty-quantification
- esm-2
- temperature-scaling
- cpu
- protein-structure
- protein-engineering
---
# ConformalESM: Distribution-Free Uncertainty Quantification for Protein Language Models
**Cites:** Lin et al. 2022, "Evolutionary Scale Prediction of Atomic Level Protein Structure with a Language Model", Science.
## Novelty
This is the **first work** to apply conformal prediction and temperature scaling to protein language models (PLMs). While 10+ papers apply these techniques to general LLMs (2023-2024), zero have ported them to the protein domain.
## Key Results
| Method | Accuracy | ECE | Avg Set Size (Ξ±=0.10) | Coverage |
|--------|----------|-----|----------------------|----------|
| Baseline ESM-2 | 61.3% | 0.147 | β€” | β€” |
| + Temperature Scaling | 61.3% | **0.058** (-61%) | β€” | β€” |
| + Conformal Prediction | β€” | β€” | 1.79 | 89.9% |
| + Class-Conditional Conformal | β€” | β€” | **1.63** | 89.9% |
**Per-class** (Ξ±=0.10, class-conditional):
- **Coil (C)**: coverage=90.8%, avg set=1.15 (most confident)
- **Helix (H)**: coverage=88.6%, avg set=1.99
- **Sheet (E)**: coverage=90.1%, avg set=1.94
## How to Run
```bash
pip install transformers datasets torch scikit-learn
python conformalesm_full.py
```
No GPU required. Runs in ~5 minutes on CPU.
## Dataset
- `lamm-mit/protein_secondary_structure_from_PDB` (125K sequences)
- 500 calibration / 500 test proteins
- 382,675 total residues evaluated