--- language: en license: mit library_name: transformers datasets: - lamm-mit/protein_secondary_structure_from_PDB models: - AmelieSchreiber/esm2_t6_8M_UR50D-finetuned-secondary-structure tags: - conformal-prediction - protein-language-models - uncertainty-quantification - esm-2 - temperature-scaling - cpu - protein-structure - protein-engineering --- # ConformalESM: Distribution-Free Uncertainty Quantification for Protein Language Models **Cites:** Lin et al. 2022, "Evolutionary Scale Prediction of Atomic Level Protein Structure with a Language Model", Science. ## Novelty This is the **first work** to apply conformal prediction and temperature scaling to protein language models (PLMs). While 10+ papers apply these techniques to general LLMs (2023-2024), zero have ported them to the protein domain. ## Key Results | Method | Accuracy | ECE | Avg Set Size (α=0.10) | Coverage | |--------|----------|-----|----------------------|----------| | Baseline ESM-2 | 61.3% | 0.147 | — | — | | + Temperature Scaling | 61.3% | **0.058** (-61%) | — | — | | + Conformal Prediction | — | — | 1.79 | 89.9% | | + Class-Conditional Conformal | — | — | **1.63** | 89.9% | **Per-class** (α=0.10, class-conditional): - **Coil (C)**: coverage=90.8%, avg set=1.15 (most confident) - **Helix (H)**: coverage=88.6%, avg set=1.99 - **Sheet (E)**: coverage=90.1%, avg set=1.94 ## How to Run ```bash pip install transformers datasets torch scikit-learn python conformalesm_full.py ``` No GPU required. Runs in ~5 minutes on CPU. ## Dataset - `lamm-mit/protein_secondary_structure_from_PDB` (125K sequences) - 500 calibration / 500 test proteins - 382,675 total residues evaluated