update readme

Files changed (4) hide show

README.md +272 -97
download_light.py +2 -1
fit_mapie_adaptive.py +333 -0
inference.py +5 -0

README.md CHANGED Viewed

@@ -8,7 +8,7 @@ license: apache-2.0
 This is the repository for [PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction](https://www.biorxiv.org/content/10.64898/2025.12.31.697180), a collection of machine learning predictors for canonical and non-canonical peptide property prediction using sequence and SMILES representations. 🧬 PeptiVerse 🌌 enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation.
-## Table of Contents
 - [Quick start](#quick-start)
 - [Installation](#installation)
@@ -20,13 +20,14 @@ This is the repository for [PeptiVerse: A Unified Platform for Therapeutic Pepti
 - [Usage](#usage)
    - [Local Application Hosting](#local-application-hosting)
    - [Dataset integration](#dataset-integration)
    - [Quick inference by property per model](#Quick-inference-by-property-per-model)
 - [Property Interpretations](#property-interpretations)
 - [Model Architecture](#model-architecture)
 - [Troubleshooting](#troubleshooting)
 - [Citation](#citation)
-## Quick Start
 - Light-weighted start (basic models, no cuML, read below for details)
 ```bash
 # Ignore all LFS files, you will see an empty folder first
@@ -69,7 +70,7 @@ pip install -r requirements.txt
 # Run inference
 python inference.py
 ```
-## Installation
 ### Minimal Setup
 - Easy start-up environment (using transformers, xgboost models)
 ```bash
@@ -85,7 +86,7 @@ pip install -r requirements.txt
     # run inference (see below)
     apptainer exec peptiverse.sif python inference.py
     ```
-## Repository Structure
 This repo contains important large files for [PeptiVerse](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse), an interactive app for peptide property prediction. [Paper link.](https://www.biorxiv.org/content/10.64898/2025.12.31.697180v1)
 ```
@@ -105,8 +106,9 @@ PeptiVerse/
 ├── best_models.txt            # Model selection manifest
 └── requirements.txt           # Python dependencies
 ```
-## Training Data Collection
 <table>
   <caption><strong>Data distribution.</strong> Classification tasks report counts for class 0/1; regression tasks report total sample size (N).</caption>
@@ -145,15 +147,15 @@ PeptiVerse/
       <td>Solubility</td>
       <td>9668</td>
       <td>8785</td>
-      <td>-</td>
-      <td>-</td>
     </tr>
     <tr>
       <td>Permeability (Penetrance)</td>
       <td>1162</td>
       <td>1162</td>
-      <td>-</td>
-      <td>-</td>
     </tr>
     <tr>
       <td>Toxicity</td>
@@ -189,39 +191,39 @@ PeptiVerse/
 </table>
-## Best Model List
 ### Full model set (cuML-enabled)
-| Property                    | Best Model (Sequence) | Best Model (SMILES) | Task Type   | Threshold (Sequence) | Threshold (SMILES) |
-|----------------------------|-----------------|---------------------|-------------|----------------|--------------------|
-| Hemolysis                  | SVM             | Transformer         | Classifier  | 0.2521         | 0.4343             |
-| Non-Fouling                | MLP             | ENET                | Classifier  | 0.57           | 0.6969             |
-| Solubility                 | CNN             | –                   | Classifier  | 0.377          | –                  |
-| Permeability (Penetrance)  | SVM             | –                   | Classifier  | 0.5493         | –                  |
-| Toxicity                   | –               | Transformer         | Classifier  | –              | 0.3401             |
-| Binding Affinity           | unpooled        | unpooled            | Regression  | –              | –                  |
-| Permeability (PAMPA)       | –               | CNN                 | Regression  | –              | –                  |
-| Permeability (Caco-2)      | –               | SVR                 | Regression  | –              | –                  |
-| Half-life                  | Transformer     | XGB                 | Regression  | –              | –                  |
 >Note: *unpooled* indicates models operating on token-level embeddings with cross-attention, rather than mean-pooled representations.
 ### Minimal deployable model set (no cuML)
-| Property                    | Best Model (WT) | Best Model (SMILES) | Task Type   | Threshold (WT) | Threshold (SMILES) |
-|----------------------------|-----------------|---------------------|-------------|----------------|--------------------|
-| Hemolysis                  | XGB             | Transformer         | Classifier  | 0.2801         | 0.4343             |
-| Non-Fouling                | MLP             | XGB                 | Classifier  | 0.57           | 0.3982             |
-| Solubility                 | CNN             | –                   | Classifier  | 0.377          | –                  |
-| Permeability (Penetrance)  | XGB             | –                   | Classifier  | 0.4301         | –                  |
-| Toxicity                   | –               | Transformer         | Classifier  | –              | 0.3401             |
-| Binding Affinity           | unpooled        | unpooled            | Regression  | –              | –                  |
-| Permeability (PAMPA)       | –               | CNN                 | Regression  | –              | –                  |
-| Permeability (Caco-2)      | –               | SVR                 | Regression  | –              | –                  |
-| Half-life                  | xgb_wt_log      | xgb_smiles          | Regression  | –              | –                  |
->Note: Models marked as SVM or ENET are replaced with XGB as these models are not currently supported in the deployment environment without cuML setups. *xgb_wt_log* indicated log-scaled transformation of time during training.
-## Usage
 ### Local Application Hosting
 - Host the [PeptiVerse UI](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse) locally with your own resources.
@@ -231,6 +233,9 @@ PeptiVerse/
 git clone https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse
 python app.py
 ```
 ### Dataset integration
 - All properties are provided with raw_data/split_ready_csvs/[huggingface_datasets](https://huggingface.co/docs/datasets/en/index).
 - Selective download the data you need with `huggingface-cli`
@@ -266,41 +271,138 @@ print("Downloaded to:", local_dir)
     - Pooled (fixed-length vector per sequence)
         - Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding.
         - Each item:
-            sequence: `str`;
-            label: `int` (classification) or `float` (regression);
-            embedding: `float32[H]` (H=1280 for ESM-2 650M);
     - Unpooled (variable-length token matrix)
         - Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix.
         - Each item:
-            sequence: `str`;
-            label: `int` (classification) or `float` (regression);
-            embedding: `float16[L, H]` (nested lists);
-            attention_mask: `int8[L]`;
-            length: `int` (=L);
 - B) SMILES-based ([PeptideCLM](https://github.com/AaronFeller/PeptideCLM) embeddings)
     - Pooled (fixed-length vector per sequence)
         - Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding.
         - Each item:
-            sequence: `str` (SMILES);
-            label: `int` (classification) or `float` (regression);
-            embedding: `float32[H]`;
     - Unpooled (variable-length token matrix)
         - Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix.
         - Each item:
-            sequence: `str` (SMILES);
-            label: `int` (classification) or `float` (regression);
-            embedding: `float16[L, H]` (nested lists);
-            attention_mask: `int8[L]`;
-            length: `int` (=L);
-### Quick Inference By Property Per Model
 ```python
 from inference import PeptiVersePredictor
-pred = PeptiVersePredictor(
-    manifest_path="best_models.txt",          # best model list
-    classifier_weight_root=".",               # repo root (where training_classifiers/ lives)
     device="cuda",                            # or "cpu"
 )
@@ -383,78 +485,150 @@ print(out)
 ```
-## Interpretation
-You can also find the same description in the paper or in the PeptiVerse app `Documentation` tab.
 ---
-#### 🩸 Hemolysis Prediction
-50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate. <br>
-**Output interpretation:**<br>
-- Score close to 1.0 = high probability of red blood cell membrane disruption<br>
 - Score close to 0.0 = non-hemolytic
 ---
-#### 💧 Solubility Prediction
-Outputs a probability (0–1) that a peptide remains soluble in aqueous conditions.<br>
-**Output interpretation:**<br>
-- Score close to 1.0 = highly soluble<br>
-- Score close to 0.0 = poorly soluble<br>
 ---
-#### 👯 Non-Fouling Prediction
-Higher scores indicate stronger non-fouling behavior, desirable for circulation and surface-exposed applications.<br>
-**Output interpretation:**<br>
-- Score close to 1.0 = non-fouling<br>
-- Score close to 0.0 = fouling<br>
 ---
-#### 🪣 Permeability Prediction
-Predicts membrane permeability on a log P scale.<br>
-**Output interpretation:**<br>
-- Higher values = more permeable (>-6.0)<br>
-- For penetrance predictions, it is a classification prediction, so within the [0, 1] range, closer to 1 indicates more permeable.<br>
 ---
-#### ⏱️ Half-Life Prediction
 **Interpretation:** Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation.
 ---
-#### ☠️ Toxicity Prediction
 **Interpretation:** Outputs a probability (0–1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk.
 ---
-#### 🔗 Binding Affinity Prediction
-Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.<br>
 **Interpretation:**<br>
     - Scores ≥ 9 correspond to tight binders (K ≤ 10⁻⁹ M, nanomolar to picomolar range)<br>
     - Scores between 7 and 9 correspond to medium binders (10⁻⁷–10⁻⁹ M, nanomolar to micromolar range)<br>
     - Scores < 7 correspond to weak binders (K ≥ 10⁻⁶ M, micromolar and weaker)<br>
     - A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.<br>
-## Model Architecture
-- **Sequence Embeddings:** [ESM-2 650M model](https://huggingface.co/facebook/esm2_t33_650M_UR50D) / [PeptideCLM model](https://huggingface.co/aaronfeller/PeptideCLM-23M-all). Foundational embeddings are frozen.
 - **XGBoost Model:** Gradient boosting on pooled embedding features for efficient, high-performance prediction.
 - **CNN/Transformer Model:** One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns.
 - **Binding Model:** Transformer-based architecture with cross-attention between protein and peptide representations.
 - **SVR Model:** Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets.
 - **Others:** SVM and Elastic Nets were trained with [RAPIDS cuML](https://github.com/rapidsai/cuml), which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository.
-## Troubleshooting
 ### LFS Download Issues
@@ -466,21 +640,22 @@ huggingface-cli download ChatterjeeLab/PeptiVerse \
     --local-dir . \
     --local-dir-use-symlinks False
 ```
-### Trouble installing cuML
-For error related to cuda library, reinstall the `torch` after installing `cuML`.
-## Citation
 If you find this repository helpful for your publications, please consider citing our paper:
 ```
-@article {zhang2025peptiverse,
 	author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam},
 	title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction},
 	year = {2026},
 	doi = {10.64898/2025.12.31.697180},
-	URL = {https://doi.org/10.64898/2025.12.31.697180},
 	journal = {bioRxiv}
 }
 ```
-To use this repository, you agree to abide by the Apache 2.0 license.

 This is the repository for [PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction](https://www.biorxiv.org/content/10.64898/2025.12.31.697180), a collection of machine learning predictors for canonical and non-canonical peptide property prediction using sequence and SMILES representations. 🧬 PeptiVerse 🌌 enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation.
+## Table of Contents 🌟
 - [Quick start](#quick-start)
 - [Installation](#installation)
 - [Usage](#usage)
    - [Local Application Hosting](#local-application-hosting)
    - [Dataset integration](#dataset-integration)
+   - [Training](#training)
    - [Quick inference by property per model](#Quick-inference-by-property-per-model)
 - [Property Interpretations](#property-interpretations)
 - [Model Architecture](#model-architecture)
 - [Troubleshooting](#troubleshooting)
 - [Citation](#citation)
+## Quick Start 🌟
 - Light-weighted start (basic models, no cuML, read below for details)
 ```bash
 # Ignore all LFS files, you will see an empty folder first
 # Run inference
 python inference.py
 ```
+## Installation 🌟
 ### Minimal Setup
 - Easy start-up environment (using transformers, xgboost models)
 ```bash
     # run inference (see below)
     apptainer exec peptiverse.sif python inference.py
     ```
+## Repository Structure 🌟
 This repo contains important large files for [PeptiVerse](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse), an interactive app for peptide property prediction. [Paper link.](https://www.biorxiv.org/content/10.64898/2025.12.31.697180v1)
 ```
 ├── best_models.txt            # Model selection manifest
 └── requirements.txt           # Python dependencies
 ```
+For full data access, please download the corresponding `training_data_cleaned` and `training_classifiers` from zenodo. The current Huggingface repo only hosts best model weights and meta data with splits labels.
+## Training Data Collection 🌟
 <table>
   <caption><strong>Data distribution.</strong> Classification tasks report counts for class 0/1; regression tasks report total sample size (N).</caption>
       <td>Solubility</td>
       <td>9668</td>
       <td>8785</td>
+      <td>9668</td>
+      <td>8785</td>
     </tr>
     <tr>
       <td>Permeability (Penetrance)</td>
       <td>1162</td>
       <td>1162</td>
+      <td>1162</td>
+      <td>1162</td>
     </tr>
     <tr>
       <td>Toxicity</td>
 </table>
+## Best Model List 🌟
 ### Full model set (cuML-enabled)
+| Property | Best Model (Sequence) | Best Model (SMILES) | Task Type | Threshold (Sequence) | Threshold (SMILES) |
+|---|---|---|---|---|---|
+| Hemolysis | SVM | CNN (chemberta) | Classifier | 0.2521 | 0.564 |
+| Non-Fouling | Transformer | ENET (peptideclm) | Classifier | 0.57 | 0.6969 |
+| Solubility | CNN | – | Classifier | 0.377 | – |
+| Permeability (Penetrance) | SVM | SVM (chemberta) | Classifier | 0.5493 | 0.573 |
+| Toxicity | – | CNN (chemberta) | Classifier | – | 0.49 |
+| Binding Affinity | unpooled | unpooled | Regression | – | – |
+| Permeability (PAMPA) | ��� | CNN (chemberta) | Regression | – | – |
+| Permeability (Caco-2) | – | SVR (chemberta) | Regression | – | – |
+| Half-life | Transformer | XGB (peptideclm) | Regression | – | – |
 >Note: *unpooled* indicates models operating on token-level embeddings with cross-attention, rather than mean-pooled representations.
 ### Minimal deployable model set (no cuML)
+| Property | Best Model (WT) | Best Model (SMILES) | Task Type | Threshold (WT) | Threshold (SMILES) |
+|---|---|---|---|---|---|
+| Hemolysis | XGB | CNN (chemberta) | Classifier | 0.2801 | 0.564 |
+| Non-Fouling | Transformer | XGB (peptideclm) | Classifier | 0.57 | 0.3892 |
+| Solubility | CNN | – | Classifier | 0.377 | – |
+| Permeability (Penetrance) | XGB | XGB (chemberta) | Classifier | 0.4301 | 0.5028 |
+| Toxicity | – | CNN (chemberta) | Classifier | – | 0.49 |
+| Binding Affinity | wt_wt_pooled | chemberta_smiles_pooled | Regression | – | – |
+| Permeability (PAMPA) | – | CNN (chemberta) | Regression | – | – |
+| Permeability (Caco-2) | – | SVR (chemberta) | Regression | – | – |
+| Half-life | Transformer | XGB (peptideclm) | Regression | – | – |
+>Note: Models marked as SVM or ENET are replaced with XGB as these models are not currently supported in the deployment environment without cuML setups.
+## Usage 🌟
 ### Local Application Hosting
 - Host the [PeptiVerse UI](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse) locally with your own resources.
 git clone https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse
 python app.py
 ```
+### Data pre-processing
+Under the `training_data_cleaned`, we provided the generated embeddings in huggingface dataset format. The following scripts are the steps used to generate the data.
 ### Dataset integration
 - All properties are provided with raw_data/split_ready_csvs/[huggingface_datasets](https://huggingface.co/docs/datasets/en/index).
 - Selective download the data you need with `huggingface-cli`
     - Pooled (fixed-length vector per sequence)
         - Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding.
         - Each item:
+            sequence: `str`
+            label: `int` (classification) or `float` (regression)
+            embedding: `float32[H]` (H=1280 for ESM-2 650M)
     - Unpooled (variable-length token matrix)
         - Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix.
         - Each item:
+            sequence: `str`
+            label: `int` (classification) or `float` (regression)
+            embedding: `float16[L, H]` (nested lists)
+            attention_mask: `int8[L]`
+            length: `int` (=L)
 - B) SMILES-based ([PeptideCLM](https://github.com/AaronFeller/PeptideCLM) embeddings)
     - Pooled (fixed-length vector per sequence)
         - Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding.
         - Each item:
+            sequence: `str` (SMILES)
+            label: `int` (classification) or `float` (regression)
+            embedding: `float32[H]`
     - Unpooled (variable-length token matrix)
         - Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix.
         - Each item:
+            sequence: `str` (SMILES)
+            label: `int` (classification) or `float` (regression)
+            embedding: `float16[L, H]` (nested lists)
+            attention_mask: `int8[L]`
+            length: `int` (=L)
+- C) SMILES-based ([ChemBERTa](https://huggingface.co/DeepChem/ChemBERTa-77M-MLM) embeddings)
+    - Pooled (fixed-length vector per sequence)
+        - Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding.
+        - Each item:
+            sequence: `str` (SMILES)
+            label: `int` (classification) or `float` (regression)
+            embedding: `float32[H]`
+    - Unpooled (variable-length token matrix)
+        - Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix.
+        - Each item:
+            sequence: `str` (SMILES)
+            label: `int` (classification) or `float` (regression)
+            embedding: `float16[L, H]` (nested lists)
+            attention_mask: `int8[L]`
+            length: `int` (=L)
+### Training
+Under the `training_classifiers` folder, we provide the python scripts used to train different models. The scripts will
+1. Read the pre-processed Huggingface Dataset from `training_data_cleaned` folder;
+2. Perform OPTUNA hyperparameter sweep once being called;
+3. All training was conducted on HPC with SLURM script under `training_classifiers/src` folder;
+4. Customize or isolate certain model training scripts as needed.
+##### Example of training
+###### ML models
+```
+HOME_LOC=/home
+SCRIPT_LOC=$HOME_LOC/PeptiVerse/training_classifiers
+EMB_LOC=$HOME_LOC/PeptiVerse/training_data_cleaned
+OBJECTIVE='hemolysis' # nf/solubility/hemolysis/permeability_pampa/permeability_caco2
+WT='smiles' # wt/smiles
+DATA_FILE="hemo_${WT}_with_embeddings"
+LOG_LOC=$SCRIPT_LOC/src/logs
+DATE=$(date +%m_%d)
+MODEL_TYPE='svm_gpu' # xgb/enet_gpu/svm_gpu
+SPECIAL_PREFIX="${MODEL_TYPE}-${OBJECTIVE}-${WT}_new"
+# Create log directory if it doesn't exist
+mkdir -p $LOG_LOC
+cd $SCRIPT_LOC
+python -u train_ml.py \
+  --dataset_path "${DATA_LOC}/${OBJECTIVE}/${DATA_FILE}" \
+  --out_dir "${SCRIPT_LOC}/${OBJECTIVE}/${MODEL_TYPE}_${WT}" \
+  --model "${MODEL_TYPE}" \
+  --n_trials 200  > "${LOG_LOC}/${DATE}_${SPECIAL_PREFIX}.log" 2>&1
+```
+###### DNN models
+```
+HOME_LOC=/home
+SCRIPT_LOC=$HOME_LOC/PeptiVerse/training_classifiers
+EMB_LOC=$HOME_LOC/PeptiVerse/training_data_cleaned
+OBJECTIVE='nf' # nf/solubility/hemolysis
+WT='smiles' #wt/smiles
+DATA_FILE="nf_${WT}_with_embeddings_unpooled"
+LOG_LOC=$SCRIPT_LOC/src/logs
+DATE=$(date +%m_%d)
+MODEL_TYPE='cnn' #mlp/cnn/transformer
+SPECIAL_PREFIX="${MODEL_TYPE}-${OBJECTIVE}-${WT}"
+# Create log directory if it doesn't exist
+mkdir -p $LOG_LOC
+cd $SCRIPT_LOC
+python -u train_nn.py \
+  --dataset_path "${DATA_LOC}/${OBJECTIVE}/${DATA_FILE}" \
+  --out_dir "${SCRIPT_LOC}/${OBJECTIVE}/${MODEL_TYPE}_${WT}" \
+  --model "${MODEL_TYPE}" \
+  --n_trials 200  > "${LOG_LOC}/${DATE}_${SPECIAL_PREFIX}.log" 2>&1
+```
+###### Binding Affinity
+```
+HOME_LOC=/home
+SCRIPT_LOC=$HOME_LOC/PeptiVerse/training_classifiers
+EMB_LOC=$HOME_LOC/PeptiVerse/training_data_cleaned
+OBJECTIVE='binding_affinity'
+BINDER_MODEL='chemberta'   # peptideclm / chemberta
+STATUS='unpooled'             # pooled / unpooled
+TYPE='smiles'
+DATA_FILE='pair_wt_${TYPE}_${STATUS}'
+LOG_LOC=$SCRIPT_LOC/src/logs
+DATE=$(date +%m_%d)
+SPECIAL_PREFIX="${OBJECTIVE}-${BINDER_MODEL}-${STATUS}"
+python -u binding_training.py \
+  --dataset_path "${EMB_LOC}/${OBJECTIVE}/${BINDER_MODEL}/${DATA_FILE}" \
+  --mode "${STATUS}" \
+  --out_dir "${SCRIPT_LOC}/${OBJECTIVE}/${BINDER_MODEL}_${TYPE}_${STATUS}" \
+  --n_trials 200 > "${LOG_LOC}/${DATE}_${SPECIAL_PREFIX}.log" 2>&1
+```
+### Quick inference by property per model
 ```python
 from inference import PeptiVersePredictor
+from pathlib import Path
+root = Path(__file__).resolve().parent  # current script folder
+predictor = PeptiVersePredictor(
+    manifest_path=root / "best_models.txt",
+    classifier_weight_root=root,
     device="cuda",                            # or "cpu"
 )
 ```
+#### Advanced inference with uncertainty prediction
+The uncertainty prediction is added as a parameter in the inference code. The full classifier folder from [zenodo]() is required to enable this functionality. The model uncertainty is reported via all the scripts listed under the `training_classifiers` folder starting with "**refit**". Detailed description can be found in the methodology part of the manuscript.
+At inference time, PeptiVersePredictor returns an `uncertainty` field with every prediction when `uncertainty=True` is passed. The method and interpretation depend on the model class, determined automatically at inference time.
+```python
+seq = "GIGAVLKVLTTGLPALISWIKRKRQQ"
+smiles = "C(C)C[C@@H]1NC(=O)[C@@H]2CCCN2C(=O)[C@@H](CC(C)C)NC(=O)[C@@H](CC(C)C)N(C)C(=O)[C@H](C)NC(=O)[C@H](Cc2ccccc2)NC1=O"
+print(predictor.predict_property("nf",    "wt",     seq, uncertainty=True))
+print(predictor.predict_property("nf",    "smiles",     smiles, uncertainty=True))
+{'property': 'nf', 'col': 'wt', 'score': 0.00014520535252195523, 'emb_tag': 'wt', 'label': 0, 'threshold': 0.57, 'uncertainty': 0.0017192508727321288, 'uncertainty_type': 'ensemble_predictive_entropy'}
+{'property': 'nf', 'col': 'smiles', 'score': 0.025485480204224586, 'emb_tag': 'peptideclm', 'label': 0, 'threshold': 0.6969, 'uncertainty': 0.11868063130587676, 'uncertainty_type': 'binary_predictive_entropy_single_model'}
+```
+---
+##### Method by Model Class
+| Model Class | Task | Uncertainty Method | Output Type | Range |
+|---|---|---|---|---|
+| MLP, CNN, Transformer | Classifier | Deep ensemble predictive entropy (5 seeds) | `float` | [0, ln(2) ≈ 0.693] |
+| MLP, CNN, Transformer | Regression | Adaptive conformal interval; falls back to ensemble std if no MAPIE bundle | `(lo, hi)` or `float` | unbounded |
+| SVM / SVC / XGBoost | Classifier | Binary predictive entropy (sigmoid of decision function) | `float` | [0, ln(2) ≈ 0.693] |
+| SVR / ElasticNet / XGBoost | Regression | Adaptive conformal interval | `(lo, hi)` | unbounded |
+> **Uncertainty is `None`** when: a DNN classifier has no seed ensemble trained, or a regression model has no `mapie_calibration.joblib` in its model directory.
 ---
+## Interpretation 🌟
+You can also find the same description in the paper or in the PeptiVerse app `Documentation` tab.
+---
+### 🩸 Hemolysis Prediction
+50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate.
+**Output interpretation:**
+- Score close to 1.0 = high probability of red blood cell membrane disruption
 - Score close to 0.0 = non-hemolytic
 ---
+### 💧 Solubility Prediction
+Outputs a probability (0–1) that a peptide remains soluble in aqueous conditions.
+**Output interpretation:**
+- Score close to 1.0 = highly soluble
+- Score close to 0.0 = poorly soluble
 ---
+### 👯 Non-Fouling Prediction
+Higher scores indicate stronger non-fouling behavior, desirable for circulation and surface-exposed applications.
+**Output interpretation:**
+- Score close to 1.0 = non-fouling
+- Score close to 0.0 = fouling
 ---
+### 🪣 Permeability Prediction
+Predicts membrane permeability on a log P scale.
+**Output interpretation:**
+- Higher values = more permeable (>-6.0)
+- For penetrance predictions, it is a classification prediction, so within the [0, 1] range, closer to 1 indicates more permeable.
 ---
+### ⏱️ Half-Life Prediction
 **Interpretation:** Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation.
 ---
+### ☠️ Toxicity Prediction
 **Interpretation:** Outputs a probability (0–1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk.
 ---
+### 🔗 Binding Affinity Prediction
+Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.
 **Interpretation:**<br>
     - Scores ≥ 9 correspond to tight binders (K ≤ 10⁻⁹ M, nanomolar to picomolar range)<br>
     - Scores between 7 and 9 correspond to medium binders (10⁻⁷–10⁻⁹ M, nanomolar to micromolar range)<br>
     - Scores < 7 correspond to weak binders (K ≥ 10⁻⁶ M, micromolar and weaker)<br>
     - A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.<br>
+---
+### Uncertainty Interpretation
+#### Entropy (classifiers)
+Binary predictive entropy of the output probability $\bar{p}$:
+$$\mathcal{H} = -\bar{p}\log\bar{p} - (1 - \bar{p})\log(1 - \bar{p})$$
+- For **DNN classifiers**: $\bar{p}$ is the mean probability across 5 independently seeded models (deep ensemble). High entropy reflects both epistemic uncertainty (seed disagreement) and aleatoric uncertainty (collectively diffuse predictions).
+- For **XGBoost / SVM / ElasticNet classifiers**: $\bar{p}$ is the single model's output probability (or sigmoid of decision function for ElasticNet). Entropy reflects output confidence of a single model only.
+| Range | Interpretation |
+|---|---|
+| < 0.1 | High confidence |
+| 0.1 – 0.4 | Moderate uncertainty |
+| 0.4 – 0.6 | Low confidence |
+| > 0.6 | Very low confidence — model close to guessing |
+| ≈ 0.693 | Maximum uncertainty — predicted probability ≈ 0.5 |
+---
+#### Adaptive Conformal Prediction Interval (regressors)
+Returned as a tuple `(lo, hi)` with 90% marginal coverage guarantee.
+We implement the **residual normalised conformity score** following [Lei et al. (2018)](https://doi.org/10.1080/01621459.2017.1307116) and [Cordier et al. (2023) / MAPIE](https://proceedings.mlr.press/v204/cordier23a.html). An auxiliary XGBoost model $\hat{\sigma}(\mathbf{x})$ is trained on held-out embeddings and absolute residuals $|y_i - \hat{y}_i|$. At inference:
+$$[\hat{y}(\mathbf{x}) - q \cdot \hat{\sigma}(\mathbf{x}),\ \hat{y}(\mathbf{x}) + q \cdot \hat{\sigma}(\mathbf{x})]$$
+where $q$ is the $\lceil(n+1)(1-\alpha)\rceil / n$ quantile of the normalized scores $s_i = |y_i - \hat{y}_i| / \hat{\sigma}(\mathbf{x}_i)$.
+- **Interval width varies per input** -- molecules more dissimilar to training data tend to receive wider intervals
+- **Coverage guarantee**: on exchangeable data, $P(y \in [\hat{y} - q\hat{\sigma},\ \hat{y} + q\hat{\sigma}]) \geq 0.90$
+- **The guarantee is marginal**, not conditional, as an unusually narrow interval on an out-of-distribution molecule does not guarantee correctness
+- **Full access**: we already computed MAPIE for all regression models, users are allowed to directly use them for customized model lists.
+---
+#### Generating a MAPIE Bundle for a New Model
+To enable conformal uncertainty for a newly trained regression model:
+```bash
+# Fit adaptive conformal bundle from val_predictions.csv
+python fit_mapie_adaptive.py --root training_classifiers --prop <property_name>
+```
+The script reads `sequence`/`smiles` and `y_pred`/`y_true` columns from the CSV, recomputes embeddings, fits the XGBoost $\hat{\sigma}$ model, and saves `mapie_calibration.joblib` into the model directory. The bundle is automatically detected and loaded by `PeptiVersePredictor` on next initialisation.
+## Model Architecture 🌟
+- **Sequence Embeddings:** [ESM-2 650M model](https://huggingface.co/facebook/esm2_t33_650M_UR50D) / [PeptideCLM model](https://huggingface.co/aaronfeller/PeptideCLM-23M-all) / [ChemBERTa](https://huggingface.co/DeepChem/ChemBERTa-77M-MLM). Foundational embeddings are frozen.
 - **XGBoost Model:** Gradient boosting on pooled embedding features for efficient, high-performance prediction.
 - **CNN/Transformer Model:** One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns.
 - **Binding Model:** Transformer-based architecture with cross-attention between protein and peptide representations.
 - **SVR Model:** Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets.
 - **Others:** SVM and Elastic Nets were trained with [RAPIDS cuML](https://github.com/rapidsai/cuml), which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository.
+## Troubleshooting 🌟
 ### LFS Download Issues
     --local-dir . \
     --local-dir-use-symlinks False
 ```
+## Citation 🌟
 If you find this repository helpful for your publications, please consider citing our paper:
 ```
+@article {Zhang2025.12.31.697180,
 	author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam},
 	title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction},
+	elocation-id = {2025.12.31.697180},
 	year = {2026},
 	doi = {10.64898/2025.12.31.697180},
+	publisher = {Cold Spring Harbor Laboratory},
+	URL = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180},
+	eprint = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180.full.pdf},
 	journal = {bioRxiv}
 }
 ```
+To use this repository, you agree to abide by the MIT License.

download_light.py CHANGED Viewed

@@ -15,8 +15,9 @@ from inference import (
 # -----------------------------
 # Config
 # -----------------------------
 MODEL_REPO = "ChatterjeeLab/PeptiVerse"
-DEFAULT_ASSETS_DIR = Path("./")   # where downloaded models live
 DEFAULT_MANIFEST = Path("./basic_models.txt")
 BANNED_MODELS = {"svm", "enet", "svm_gpu", "enet_gpu"}

 # -----------------------------
 # Config
 # -----------------------------
+root = Path(__file__).resolve().parent  # current script folder
 MODEL_REPO = "ChatterjeeLab/PeptiVerse"
+DEFAULT_ASSETS_DIR = Path(root)   # where downloaded models live
 DEFAULT_MANIFEST = Path("./basic_models.txt")
 BANNED_MODELS = {"svm", "enet", "svm_gpu", "enet_gpu"}

fit_mapie_adaptive.py ADDED Viewed

	@@ -0,0 +1,333 @@

+"""
+Bundle format:
+    {
+        "quantile":    q,
+        "sigma_model": xgb_booster,
+        "emb_tag":     "wt"|"peptideclm"|"chemberta",
+        "alpha":       0.1,
+        "adaptive":    True,
+    }
+Binding affinity bundles additionally store "target_emb_tag": "wt" since
+both binder and target embeddings are concatenated for the sigma model.
+"""
+import argparse
+import sys
+import numpy as np
+import pandas as pd
+import joblib
+import xgboost as xgb
+import torch
+from pathlib import Path
+from typing import Optional
+sys.path.insert(0, str(Path(__file__).parent))
+DEVICE      = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+WEIGHT_ROOT = Path(__file__).parent
+# Properties to skip
+SKIP_PROPS = {"half_life", "halflife"}
+def should_skip(model_dir: Path) -> bool:
+    return any(part in SKIP_PROPS for part in model_dir.parts)
+# Embedding tag inference
+def infer_emb_tag(folder_name: str) -> Optional[str]:
+    n = folder_name.lower()
+    if "chemberta" in n:  return "chemberta"
+    if "peptideclm" in n: return "peptideclm"
+    if "smiles" in n:     return "peptideclm"
+    if "wt" in n:         return "wt"
+    return None
+def is_binding_affinity(model_dir: Path) -> bool:
+    return "binding_affinity" in model_dir.parts
+def infer_binding_emb_tags(folder_name: str):
+    """
+    Returns (binder_emb_tag, target_emb_tag) for binding affinity folders.
+    Folder convention: {target_emb}_{binder_emb}_{pooled|unpooled}
+    e.g. wt_wt_unpooled, chemberta_smiles_unpooled, peptideclm_smiles_unpooled
+    """
+    n = folder_name.lower()
+    # target is always ESM2 (wt)
+    target_emb = "wt"
+    # binder emb from folder name
+    if "chemberta" in n:  binder_emb = "chemberta"
+    elif "peptideclm" in n: binder_emb = "peptideclm"
+    else: binder_emb = "wt"
+    return binder_emb, target_emb
+SEQ_CANDIDATES  = ["sequence", "smiles", "seq", "peptide", "molecule"]
+PRED_CANDIDATES = ["y_prob", "y_pred", "pred_prob", "pred_score", "score", "pred", "prediction"]
+TRUE_CANDIDATES = ["y_true", "label", "true_label", "affinity", "y", "target"]
+def resolve_col(df, candidates, label):
+    cl = {c.lower(): c for c in df.columns}
+    for c in candidates:
+        if c.lower() in cl:
+            return cl[c.lower()]
+    raise ValueError(f"Cannot find {label} column. Available: {list(df.columns)}")
+_embedders = {}
+def get_embedder(emb_tag: str):
+    if emb_tag in _embedders:
+        return _embedders[emb_tag]
+    if emb_tag == "wt":
+        from inference_new import WTEmbedder
+        emb = WTEmbedder(DEVICE)
+    elif emb_tag == "peptideclm":
+        from inference_new import SMILESEmbedder
+        emb = SMILESEmbedder(
+            DEVICE,
+            vocab_path=str(WEIGHT_ROOT / "tokenizer/new_vocab.txt"),
+            splits_path=str(WEIGHT_ROOT / "tokenizer/new_splits.txt"),
+        )
+    elif emb_tag == "chemberta":
+        from inference_new import ChemBERTaEmbedder
+        emb = ChemBERTaEmbedder(DEVICE)
+    else:
+        raise ValueError(f"Unknown emb_tag: {emb_tag}")
+    _embedders[emb_tag] = emb
+    return emb
+@torch.no_grad()
+def embed_sequences(sequences: list, emb_tag: str) -> np.ndarray:
+    embedder = get_embedder(emb_tag)
+    vecs = []
+    for seq in sequences:
+        v = embedder.pooled(seq.strip())
+        vecs.append(v.cpu().float().numpy())
+    return np.vstack(vecs).astype(np.float32)
+# Sigma model, simple XGB
+def fit_sigma_model(X: np.ndarray, residuals: np.ndarray) -> xgb.Booster:
+    dtrain = xgb.DMatrix(X, label=residuals)
+    params = {
+        "objective":        "reg:squarederror",
+        "max_depth":        4,
+        "eta":              0.05,
+        "subsample":        0.8,
+        "colsample_bytree": 0.3,
+        "min_child_weight": 5,
+        "tree_method":      "hist",
+        "device":           "cuda" if torch.cuda.is_available() else "cpu",
+        "seed":             1986,
+    }
+    return xgb.train(params, dtrain, num_boost_round=200, verbose_eval=False)
+# Standard model dir fitting
+def fit_standard(model_dir: Path, alpha: float, dry_run: bool) -> str:
+    val_path = model_dir / "val_predictions.csv"
+    if not val_path.exists():
+        val_path = model_dir / "oof_predictions.csv"
+    if not val_path.exists():
+        return "skip (no val/oof CSV)"
+    emb_tag = infer_emb_tag(model_dir.name)
+    if emb_tag is None:
+        return "skip (cannot infer emb_tag)"
+    try:
+        df = pd.read_csv(val_path)
+        seq_col  = resolve_col(df, SEQ_CANDIDATES,  "sequence")
+        pred_col = resolve_col(df, PRED_CANDIDATES, "pred")
+        true_col = resolve_col(df, TRUE_CANDIDATES, "true")
+    except Exception as e:
+        return f"error: {e}"
+    sequences = df[seq_col].astype(str).tolist()
+    y_pred    = df[pred_col].values.astype(np.float64)
+    y_true    = df[true_col].values.astype(np.float64)
+    mask = np.isfinite(y_pred) & np.isfinite(y_true)
+    sequences = [s for s, m in zip(sequences, mask) if m]
+    y_pred, y_true = y_pred[mask], y_true[mask]
+    n = len(y_pred)
+    if n < 30:
+        return f"skip (only {n} samples)"
+    if dry_run:
+        return f"would fit (n={n}, emb={emb_tag})"
+    try:
+        X = embed_sequences(sequences, emb_tag)
+    except Exception as e:
+        return f"error embedding: {e}"
+    residuals   = np.abs(y_true - y_pred).astype(np.float32)
+    sigma_model = fit_sigma_model(X, residuals)
+    sigma_cal   = np.clip(sigma_model.predict(xgb.DMatrix(X)).astype(np.float64), 1e-6, None)
+    norm_scores = (residuals / sigma_cal)
+    level       = min(1.0, np.ceil((n + 1) * (1 - alpha)) / n)
+    q           = float(np.quantile(norm_scores, level))
+    lo, hi      = y_pred - q * sigma_cal, y_pred + q * sigma_cal
+    coverage    = float(np.mean((y_true >= lo) & (y_true <= hi)))
+    avg_width   = float(np.mean(hi - lo))
+    bundle = {"quantile": q, "sigma_model": sigma_model,
+              "emb_tag": emb_tag, "alpha": alpha, "adaptive": True}
+    joblib.dump(bundle, model_dir / "mapie_calibration.joblib")
+    return f"ok (n={n}, emb={emb_tag}, q={q:.4f}, cov={coverage:.3f}, avg_width={avg_width:.3f})"
+# Binding affinity fitting
+def fit_binding_affinity(model_dir: Path, alpha: float, dry_run: bool) -> str:
+    val_path = model_dir / "val_predictions.csv"
+    if not val_path.exists():
+        return "skip (no val_predictions.csv)"
+    binder_emb, target_emb = infer_binding_emb_tags(model_dir.name)
+    try:
+        df = pd.read_csv(val_path)
+    except Exception as e:
+        return f"error reading CSV: {e}"
+    # Binding affinity CSV has both sequence (binder) and target_sequence
+    cl = {c.lower(): c for c in df.columns}
+    if "sequence" not in cl or "target_sequence" not in cl:
+        return f"skip (missing sequence/target_sequence columns, have: {list(df.columns)})"
+    binder_seqs = df[cl["sequence"]].astype(str).tolist()
+    target_seqs = df[cl["target_sequence"]].astype(str).tolist()
+    try:
+        pred_col = resolve_col(df, PRED_CANDIDATES, "pred")
+        true_col = resolve_col(df, TRUE_CANDIDATES, "true")
+    except Exception as e:
+        return f"error: {e}"
+    y_pred = df[pred_col].values.astype(np.float64)
+    y_true = df[true_col].values.astype(np.float64)
+    mask = np.isfinite(y_pred) & np.isfinite(y_true)
+    binder_seqs = [s for s, m in zip(binder_seqs, mask) if m]
+    target_seqs = [s for s, m in zip(target_seqs, mask) if m]
+    y_pred, y_true = y_pred[mask], y_true[mask]
+    n = len(y_pred)
+    if n < 30:
+        return f"skip (only {n} samples)"
+    if dry_run:
+        return f"would fit (n={n}, binder_emb={binder_emb}, target_emb={target_emb})"
+    try:
+        X_binder = embed_sequences(binder_seqs, binder_emb)   # (n, H_b)
+        X_target = embed_sequences(target_seqs, target_emb)   # (n, H_t)
+        X = np.concatenate([X_target, X_binder], axis=1)      # (n, H_t+H_b)
+    except Exception as e:
+        return f"error embedding: {e}"
+    # Absolute residuals on held-out validation predictions.
+    # Equivalent to ResidualNormalisedScore.get_signed_conformity_scores()
+    # in MAPIE (Cordier et al. 2023), which computes |y - y_pred| / sigma_hat.
+    # We compute residuals first, then fit sigma_hat below, matching MAPIE's
+    # two-stage procedure (fit residual estimator → normalize → take quantile).
+    residuals   = np.abs(y_true - y_pred).astype(np.float32)
+    # Fit sigma_hat: auxiliary XGBoost regressor trained on (embeddings, |residuals|).
+    # Corresponds to ResidualNormalisedScore's residual_estimator fitted on
+    # (X_res, |y_res - y_hat_res|) per the MAPIE tutorial. MAPIE fits sigma_hat
+    # on log-residuals and exponentiates predictions to ensure positivity; we
+    # instead clip sigma to 1e-6 for the same effect.
+    sigma_model = fit_sigma_model(X, residuals)
+    sigma_cal   = np.clip(sigma_model.predict(xgb.DMatrix(X)).astype(np.float64), 1e-6, None)
+    # Normalized conformity scores: s_i = |y_i - y_hat_i| / sigma_hat(x_i).
+    # This is the ResidualNormalisedScore formula from MAPIE. Larger scores
+    # encode worse agreement between prediction and observation (Vovk et al. 2005).
+    norm_scores = residuals / sigma_cal
+    # Finite-sample corrected conformal quantile at level ceil((n+1)(1-alpha))/n.
+    # Guarantees marginal coverage >= 1-alpha under exchangeability
+    # (Lei et al. 2018, Theorem 1).
+    level       = min(1.0, np.ceil((n + 1) * (1 - alpha)) / n)
+    q           = float(np.quantile(norm_scores, level))
+    lo, hi      = y_pred - q * sigma_cal, y_pred + q * sigma_cal
+    coverage    = float(np.mean((y_true >= lo) & (y_true <= hi)))
+    avg_width   = float(np.mean(hi - lo))
+    bundle = {
+        "quantile":       q,
+        "sigma_model":    sigma_model,
+        "emb_tag":        binder_emb,
+        "target_emb_tag": target_emb,
+        "alpha":          alpha,
+        "adaptive":       True,
+    }
+    joblib.dump(bundle, model_dir / "mapie_calibration.joblib")
+    return (f"ok (n={n}, binder={binder_emb}, target={target_emb}, "
+            f"q={q:.4f}, cov={coverage:.3f}, avg_width={avg_width:.3f})")
+MODEL_PATTERNS = [
+    "xgb_*", "enet_*", "svm_*", "svr_*", "mlp_*", "cnn_*", "transformer_*",
+    "wt_wt_*", "wt_smiles_*", "peptideclm_smiles_*", "chemberta_smiles_*",
+]
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--root",      type=Path, required=True)
+    parser.add_argument("--alpha",     type=float, default=0.1)
+    parser.add_argument("--prop",      type=str,   default=None,
+                        help="Only process a specific property subfolder")
+    parser.add_argument("--dry-run",   action="store_true")
+    parser.add_argument("--overwrite", action="store_true")
+    args = parser.parse_args()
+    search_root = args.root / args.prop if args.prop else args.root
+    model_dirs = []
+    for pat in MODEL_PATTERNS:
+        model_dirs.extend(sorted(search_root.rglob(pat)))
+    model_dirs = [d for d in model_dirs if d.is_dir()]
+    print(f"Found {len(model_dirs)} model dirs under {search_root}")
+    if args.dry_run:
+        print("DRY RUN\n")
+    counts = {"ok": 0, "skip": 0, "error": 0}
+    for model_dir in model_dirs:
+        rel = model_dir.relative_to(args.root)
+        if should_skip(model_dir):
+            print(f"  SKIP    {rel}  (halflife — no sequence in OOF CSV)")
+            counts["skip"] += 1
+            continue
+        out = model_dir / "mapie_calibration.joblib"
+        if out.exists() and not args.overwrite:
+            try:
+                b = joblib.load(out)
+                if b.get("adaptive"):
+                    print(f"  OK      {rel}  (already adaptive)")
+                    counts["ok"] += 1
+                    continue
+            except Exception:
+                pass
+        print(f"  FITTING {rel} ...", end=" ", flush=True)
+        if is_binding_affinity(model_dir):
+            status = fit_binding_affinity(model_dir, args.alpha, args.dry_run)
+        else:
+            status = fit_standard(model_dir, args.alpha, args.dry_run)
+        tag = "ok" if status.startswith("ok") else ("skip" if status.startswith("skip") else "error")
+        counts[tag] += 1
+        print(status)
+    print(f"\nDone. {counts}")
+if __name__ == "__main__":
+    main()

inference.py CHANGED Viewed

@@ -361,6 +361,11 @@ def _mapie_uncertainty(mapie_bundle: dict, score: float,
     if mapie_bundle.get("adaptive") and "sigma_model" in mapie_bundle:
         q = float(mapie_bundle["quantile"])
         if embedding is not None:
             sigma_model = mapie_bundle["sigma_model"]
             sigma = float(sigma_model.predict(xgb.DMatrix(embedding.reshape(1, -1)))[0])
             sigma = max(sigma, 1e-6)

     if mapie_bundle.get("adaptive") and "sigma_model" in mapie_bundle:
         q = float(mapie_bundle["quantile"])
         if embedding is not None:
+            # Adaptive interval: y_hat ± q * sigma_hat(x).
+            # Equivalent to MAPIE's get_estimation_distribution():
+            #   y_pred + conformity_scores * r_pred
+            # where conformity_scores=q and r_pred=sigma_hat(x).
+            # (ResidualNormalisedScore, Cordier et al. 2023)
             sigma_model = mapie_bundle["sigma_model"]
             sigma = float(sigma_model.predict(xgb.DMatrix(embedding.reshape(1, -1)))[0])
             sigma = max(sigma, 1e-6)