--- license: apache-2.0 --- ![Overview of PeptiVerse](peptiverse-cover.png) # PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction ๐Ÿงฌ๐ŸŒŒ This is the repository for [PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction](https://www.biorxiv.org/content/10.64898/2025.12.31.697180), a collection of machine learning predictors for canonical and non-canonical peptide property prediction using sequence and SMILES representations. ๐Ÿงฌ PeptiVerse ๐ŸŒŒ enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation. ## Table of Contents ๐ŸŒŸ - [Quick start](#quick-start) - [Installation](#installation) - [Repository Structure](#repository-structure) - [Training data collection](#training-data-collection) - [Best model list](#best-model-list) - [Full model set (cuML-enabled)](#full-model-set-gpu-enabled) - [Minimal deployable model set (no cuML)](#minimal-deployable-set) - [Usage](#usage) - [Local Application Hosting](#local-application-hosting) - [Dataset integration](#dataset-integration) - [Training](#training) - [Quick inference by property per model](#Quick-inference-by-property-per-model) - [Property Interpretations](#property-interpretations) - [Model Architecture](#model-architecture) - [Troubleshooting](#troubleshooting) - [Citation](#citation) ## Quick Start ๐ŸŒŸ - Light-weighted start (basic models, no cuML, read below for details) ```bash # Ignore all LFS files, you will see an empty folder first git clone --no-checkout https://huggingface.co/ChatterjeeLab/PeptiVerse cd PeptiVerse # Enable sparse checkout git sparse-checkout init --cone # Choose only selective items to download git sparse-checkout set \ inference.py \ download_light.py \ best_models.txt \ basic_models.txt \ requirements.txt \ tokenizer \ README.md # Now checkout GIT_LFS_SKIP_SMUDGE=1 git checkout # Install basic pkgs pip install -r requirements.txt # Download basic model weights according to the basic_models.txt. Adjust which config you wanted as needed. python download_light.py # Test in inference python inference.py ``` - Full model clone (will clone all model weights) ```bash # Clone repository git clone https://huggingface.co/ChatterjeeLab/PeptiVerse # Install dependencies pip install -r requirements.txt # Run inference python inference.py ``` ## Installation ๐ŸŒŸ ### Minimal Setup - Easy start-up environment (using transformers, xgboost models) ```bash pip install -r requirements.txt ``` ### Full Setup - Additional access to trained SVM and ElastNet models requires installation of `RAPIDS cuML`, with instructions available from their official [github page](https://github.com/rapidsai/cuml) (**CUDA-capable GPU required**). - Optional: pre-compiled Singularity/Apptainer environment (7.52G) is available at [Google drive](https://drive.google.com/file/d/1RJQ9HK0_gsPOhRo5H5ZmH_MYcpJqQD7e/view?usp=sharing) with everything you need (still need CUDA/GPU to load cuML models). ``` # test apptainer exec peptiverse.sif python -c "import sys; print(sys.executable)" # run inference (see below) apptainer exec peptiverse.sif python inference.py ``` ## Repository Structure ๐ŸŒŸ This repo contains important large files for [PeptiVerse](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse), an interactive app for peptide property prediction. [Paper link.](https://www.biorxiv.org/content/10.64898/2025.12.31.697180v1) ``` PeptiVerse/ โ”œโ”€โ”€ training_data_cleaned/ # Processed datasets with embeddings โ”‚ โ””โ”€โ”€ / # Property-specific data โ”‚ โ”œโ”€โ”€ train/val splits โ”‚ โ””โ”€โ”€ precomputed embeddings โ”œโ”€โ”€ training_classifiers/ # Trained model weights โ”‚ โ””โ”€โ”€ / โ”‚ โ”œโ”€โ”€ cnn_wt/ # CNN architectures โ”‚ โ”œโ”€โ”€ mlp_wt/ # MLP architectures โ”‚ โ””โ”€โ”€ xgb_wt/ # XGBoost models โ”œโ”€โ”€ tokenizer/ # PeptideCLM tokenizer โ”œโ”€โ”€ training_data/ # Raw training data โ”œโ”€โ”€ inference.py # Main prediction interface โ”œโ”€โ”€ best_models.txt # Model selection manifest โ””โ”€โ”€ requirements.txt # Python dependencies ``` For full data access, please download the corresponding `training_data_cleaned` and `training_classifiers` from zenodo. The current Huggingface repo only hosts best model weights and meta data with splits labels. ## Training Data Collection ๐ŸŒŸ
Data distribution. Classification tasks report counts for class 0/1; regression tasks report total sample size (N).
Properties Amino Acid Sequences SMILES Sequences
0 1 0 1
Classification
Hemolysis 4765 1311 4765 1311
Non-Fouling 13580 3600 13580 3600
Solubility 9668 8785 9668 8785
Permeability (Penetrance) 1162 1162 1162 1162
Toxicity - - 5518 5518
Regression (N)
Permeability (PAMPA) - 6869
Permeability (CACO2) - 606
Half-Life 130 245
Binding Affinity 1436 1597
## Best Model List ๐ŸŒŸ ### Full model set (cuML-enabled) | Property | Best Model (Sequence) | Best Model (SMILES) | Task Type | Threshold (Sequence) | Threshold (SMILES) | |---|---|---|---|---|---| | Hemolysis | SVM | CNN (chemberta) | Classifier | 0.2521 | 0.564 | | Non-Fouling | Transformer | ENET (peptideclm) | Classifier | 0.57 | 0.6969 | | Solubility | CNN | โ€“ | Classifier | 0.377 | โ€“ | | Permeability (Penetrance) | SVM | SVM (chemberta) | Classifier | 0.5493 | 0.573 | | Toxicity | โ€“ | CNN (chemberta) | Classifier | โ€“ | 0.49 | | Binding Affinity | unpooled | unpooled | Regression | โ€“ | โ€“ | | Permeability (PAMPA) | โ€“ | CNN (chemberta) | Regression | โ€“ | โ€“ | | Permeability (Caco-2) | โ€“ | SVR (chemberta) | Regression | โ€“ | โ€“ | | Half-life | Transformer | XGB (peptideclm) | Regression | โ€“ | โ€“ | >Note: *unpooled* indicates models operating on token-level embeddings with cross-attention, rather than mean-pooled representations. ### Minimal deployable model set (no cuML) | Property | Best Model (WT) | Best Model (SMILES) | Task Type | Threshold (WT) | Threshold (SMILES) | |---|---|---|---|---|---| | Hemolysis | XGB | CNN (chemberta) | Classifier | 0.2801 | 0.564 | | Non-Fouling | Transformer | XGB (peptideclm) | Classifier | 0.57 | 0.3892 | | Solubility | CNN | โ€“ | Classifier | 0.377 | โ€“ | | Permeability (Penetrance) | XGB | XGB (chemberta) | Classifier | 0.4301 | 0.5028 | | Toxicity | โ€“ | CNN (chemberta) | Classifier | โ€“ | 0.49 | | Binding Affinity | wt_wt_pooled | chemberta_smiles_pooled | Regression | โ€“ | โ€“ | | Permeability (PAMPA) | โ€“ | CNN (chemberta) | Regression | โ€“ | โ€“ | | Permeability (Caco-2) | โ€“ | SVR (chemberta) | Regression | โ€“ | โ€“ | | Half-life | Transformer | XGB (peptideclm) | Regression | โ€“ | โ€“ | >Note: Models marked as SVM or ENET are replaced with XGB as these models are not currently supported in the deployment environment without cuML setups. ## Usage ๐ŸŒŸ ### Local Application Hosting - Host the [PeptiVerse UI](https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse) locally with your own resources. ```bash # Configure models in best_models.txt git clone https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse python app.py ``` ### Data pre-processing Under the `training_data_cleaned`, we provided the generated embeddings in huggingface dataset format. The following scripts are the steps used to generate the data. ### Dataset integration - All properties are provided with raw_data/split_ready_csvs/[huggingface_datasets](https://huggingface.co/docs/datasets/en/index). - Selective download the data you need with `huggingface-cli` ```bash huggingface-cli download ChatterjeeLab/PeptiVerse \ --include "training_data_cleaned/**" \ # only this folder --exclude "**/*.pt" "**/*.joblib" \ # skip weights/artifacts --local-dir PeptiVerse_partial \ --local-dir-use-symlinks False # make real copies ``` - Or in python ```python from huggingface_hub import snapshot_download local_dir = snapshot_download( repo_id="ChatterjeeLab/PeptiVerse", allow_patterns=["training_data_cleaned/**"], # only this folder ignore_patterns=["**/*.pt", "**/*.joblib"], # skip weights/artifacts local_dir="PeptiVerse_partial", local_dir_use_symlinks=False, # make real copies ) print("Downloaded to:", local_dir) ``` - Usage of the huggingface datasets (with pre-computed embeddings and splits) - All embedding datasets are saved via `DatasetDict.save_to_disk` and loadable with: ``` python from datasets import load_from_disk ds = load_from_disk(PATH) train_ds = ds["train"] val_ds = ds["val"] ``` - A) Sequence Based ([ESM-2](https://huggingface.co/facebook/esm2_t33_650M_UR50D) embeddings) - Pooled (fixed-length vector per sequence) - Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding. - Each item: sequence: `str` label: `int` (classification) or `float` (regression) embedding: `float32[H]` (H=1280 for ESM-2 650M) - Unpooled (variable-length token matrix) - Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix. - Each item: sequence: `str` label: `int` (classification) or `float` (regression) embedding: `float16[L, H]` (nested lists) attention_mask: `int8[L]` length: `int` (=L) - B) SMILES-based ([PeptideCLM](https://github.com/AaronFeller/PeptideCLM) embeddings) - Pooled (fixed-length vector per sequence) - Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding. - Each item: sequence: `str` (SMILES) label: `int` (classification) or `float` (regression) embedding: `float32[H]` - Unpooled (variable-length token matrix) - Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix. - Each item: sequence: `str` (SMILES) label: `int` (classification) or `float` (regression) embedding: `float16[L, H]` (nested lists) attention_mask: `int8[L]` length: `int` (=L) - C) SMILES-based ([ChemBERTa](https://huggingface.co/DeepChem/ChemBERTa-77M-MLM) embeddings) - Pooled (fixed-length vector per sequence) - Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding. - Each item: sequence: `str` (SMILES) label: `int` (classification) or `float` (regression) embedding: `float32[H]` - Unpooled (variable-length token matrix) - Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix. - Each item: sequence: `str` (SMILES) label: `int` (classification) or `float` (regression) embedding: `float16[L, H]` (nested lists) attention_mask: `int8[L]` length: `int` (=L) ### Training Under the `training_classifiers` folder, we provide the python scripts used to train different models. The scripts will 1. Read the pre-processed Huggingface Dataset from `training_data_cleaned` folder; 2. Perform OPTUNA hyperparameter sweep once being called; 3. All training was conducted on HPC with SLURM script under `training_classifiers/src` folder; 4. Customize or isolate certain model training scripts as needed. ##### Example of training ###### ML models ``` HOME_LOC=/home SCRIPT_LOC=$HOME_LOC/PeptiVerse/training_classifiers EMB_LOC=$HOME_LOC/PeptiVerse/training_data_cleaned OBJECTIVE='hemolysis' # nf/solubility/hemolysis/permeability_pampa/permeability_caco2 WT='smiles' # wt/smiles DATA_FILE="hemo_${WT}_with_embeddings" LOG_LOC=$SCRIPT_LOC/src/logs DATE=$(date +%m_%d) MODEL_TYPE='svm_gpu' # xgb/enet_gpu/svm_gpu SPECIAL_PREFIX="${MODEL_TYPE}-${OBJECTIVE}-${WT}_new" # Create log directory if it doesn't exist mkdir -p $LOG_LOC cd $SCRIPT_LOC python -u train_ml.py \ --dataset_path "${DATA_LOC}/${OBJECTIVE}/${DATA_FILE}" \ --out_dir "${SCRIPT_LOC}/${OBJECTIVE}/${MODEL_TYPE}_${WT}" \ --model "${MODEL_TYPE}" \ --n_trials 200 > "${LOG_LOC}/${DATE}_${SPECIAL_PREFIX}.log" 2>&1 ``` ###### DNN models ``` HOME_LOC=/home SCRIPT_LOC=$HOME_LOC/PeptiVerse/training_classifiers EMB_LOC=$HOME_LOC/PeptiVerse/training_data_cleaned OBJECTIVE='nf' # nf/solubility/hemolysis WT='smiles' #wt/smiles DATA_FILE="nf_${WT}_with_embeddings_unpooled" LOG_LOC=$SCRIPT_LOC/src/logs DATE=$(date +%m_%d) MODEL_TYPE='cnn' #mlp/cnn/transformer SPECIAL_PREFIX="${MODEL_TYPE}-${OBJECTIVE}-${WT}" # Create log directory if it doesn't exist mkdir -p $LOG_LOC cd $SCRIPT_LOC python -u train_nn.py \ --dataset_path "${DATA_LOC}/${OBJECTIVE}/${DATA_FILE}" \ --out_dir "${SCRIPT_LOC}/${OBJECTIVE}/${MODEL_TYPE}_${WT}" \ --model "${MODEL_TYPE}" \ --n_trials 200 > "${LOG_LOC}/${DATE}_${SPECIAL_PREFIX}.log" 2>&1 ``` ###### Binding Affinity ``` HOME_LOC=/home SCRIPT_LOC=$HOME_LOC/PeptiVerse/training_classifiers EMB_LOC=$HOME_LOC/PeptiVerse/training_data_cleaned OBJECTIVE='binding_affinity' BINDER_MODEL='chemberta' # peptideclm / chemberta STATUS='unpooled' # pooled / unpooled TYPE='smiles' DATA_FILE='pair_wt_${TYPE}_${STATUS}' LOG_LOC=$SCRIPT_LOC/src/logs DATE=$(date +%m_%d) SPECIAL_PREFIX="${OBJECTIVE}-${BINDER_MODEL}-${STATUS}" python -u binding_training.py \ --dataset_path "${EMB_LOC}/${OBJECTIVE}/${BINDER_MODEL}/${DATA_FILE}" \ --mode "${STATUS}" \ --out_dir "${SCRIPT_LOC}/${OBJECTIVE}/${BINDER_MODEL}_${TYPE}_${STATUS}" \ --n_trials 200 > "${LOG_LOC}/${DATE}_${SPECIAL_PREFIX}.log" 2>&1 ``` ### Quick inference by property per model ```python from inference import PeptiVersePredictor from pathlib import Path root = Path(__file__).resolve().parent # current script folder predictor = PeptiVersePredictor( manifest_path=root / "best_models.txt", classifier_weight_root=root, device="cuda", # or "cpu" ) # mode: smiles (SMILES-based models) / wt (Sequence-based models) # property keys (with some level of name normalization) # hemolysis # nf (Non-Fouling) # solubility # permeability_penetrance # toxicity # permeability_pampa # permeability_caco2 # halflife # binding_affinity seq = "GIVEQCCTSICSLYQLENYCN" smiles = "CC(C)C[C@@H]1NC(=O)[C@@H](CC(C)C)N(C)C(=O)[C@@H](C)N(C)C(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](CC(C)C)N(C)C(=O)[C@H]2CCCN2C1=O" # Hemolysis out = pred.predict_property("hemolysis", mode="wt", input_str=seq) print(out) # {"property":"hemolysis","mode":"wt","score":prob,"label":0/1,"threshold":...} out = pred.predict_property("hemolysis", mode="smiles", input_str=smiles) print(out) # Non-fouling (key is nf) out = pred.predict_property("nf", mode="wt", input_str=seq) print(out) out = pred.predict_property("nf", mode="smiles", input_str=smiles) print(out) # Solubility (Sequence-only) out = pred.predict_property("solubility", mode="wt", input_str=seq) print(out) # Permeability (Penetrance) (Sequence-only) out = pred.predict_property("permeability_penetrance", mode="wt", input_str=seq) print(out) # Toxicity (SMILES-only) out = pred.predict_property("toxicity", mode="smiles", input_str=smiles) print(out) # Permeability (PAMPA) (SMILES regression) out = pred.predict_property("permeability_pampa", mode="smiles", input_str=smiles) print(out) # {"property":"permeability_pampa","mode":"smiles","score":value} # Permeability (Caco-2) (SMILES regression) out = pred.predict_property("permeability_caco2", mode="smiles", input_str=smiles) print(out) # Half-life (sequence-based + SMILES regression) out = pred.predict_property("halflife", mode="wt", input_str=seq) print(out) out = pred.predict_property("halflife", mode="smiles", input_str=smiles) print(out) # Binding Affinity protein = "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQV..." # target protein peptide_seq = "GIVEQCCTSICSLYQLENYCN" out = pred.predict_binding_affinity( mode="wt", target_seq=protein, binder_str=peptide_seq, ) print(out) # { # "property":"binding_affinity", # "mode":"wt", # "affinity": float, # "class_by_threshold": "High (โ‰ฅ9)" / "Moderate (7-9)" / "Low (<7)", # "class_by_logits": same buckets, # "binding_model": "pooled" or "unpooled", # } ``` #### Advanced inference with uncertainty prediction The uncertainty prediction is added as a parameter in the inference code. The full classifier folder from [zenodo]() is required to enable this functionality. The model uncertainty is reported via all the scripts listed under the `training_classifiers` folder starting with "**refit**". Detailed description can be found in the methodology part of the manuscript. At inference time, PeptiVersePredictor returns an `uncertainty` field with every prediction when `uncertainty=True` is passed. The method and interpretation depend on the model class, determined automatically at inference time. ```python seq = "GIGAVLKVLTTGLPALISWIKRKRQQ" smiles = "C(C)C[C@@H]1NC(=O)[C@@H]2CCCN2C(=O)[C@@H](CC(C)C)NC(=O)[C@@H](CC(C)C)N(C)C(=O)[C@H](C)NC(=O)[C@H](Cc2ccccc2)NC1=O" print(predictor.predict_property("nf", "wt", seq, uncertainty=True)) print(predictor.predict_property("nf", "smiles", smiles, uncertainty=True)) {'property': 'nf', 'col': 'wt', 'score': 0.00014520535252195523, 'emb_tag': 'wt', 'label': 0, 'threshold': 0.57, 'uncertainty': 0.0017192508727321288, 'uncertainty_type': 'ensemble_predictive_entropy'} {'property': 'nf', 'col': 'smiles', 'score': 0.025485480204224586, 'emb_tag': 'peptideclm', 'label': 0, 'threshold': 0.6969, 'uncertainty': 0.11868063130587676, 'uncertainty_type': 'binary_predictive_entropy_single_model'} ``` --- ##### Method by Model Class | Model Class | Task | Uncertainty Method | Output Type | Range | |---|---|---|---|---| | MLP, CNN, Transformer | Classifier | Deep ensemble predictive entropy (5 seeds) | `float` | [0, ln(2) โ‰ˆ 0.693] | | MLP, CNN, Transformer | Regression | Adaptive conformal interval; falls back to ensemble std if no MAPIE bundle | `(lo, hi)` or `float` | unbounded | | SVM / SVC / XGBoost | Classifier | Binary predictive entropy (sigmoid of decision function) | `float` | [0, ln(2) โ‰ˆ 0.693] | | SVR / ElasticNet / XGBoost | Regression | Adaptive conformal interval | `(lo, hi)` | unbounded | > **Uncertainty is `None`** when: a DNN classifier has no seed ensemble trained, or a regression model has no `mapie_calibration.joblib` in its model directory. --- ## Interpretation ๐ŸŒŸ You can also find the same description in the paper or in the PeptiVerse app `Documentation` tab. --- ### ๐Ÿฉธ Hemolysis Prediction
50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate.
**Output interpretation:**
- Score close to 1.0 = high probability of red blood cell membrane disruption - Score close to 0.0 = non-hemolytic --- ### ๐Ÿ’ง Solubility Prediction
Outputs a probability (0โ€“1) that a peptide remains soluble in aqueous conditions.
**Output interpretation:**
- Score close to 1.0 = highly soluble - Score close to 0.0 = poorly soluble --- ### ๐Ÿ‘ฏ Non-Fouling Prediction
Higher scores indicate stronger non-fouling behavior, desirable for circulation and surface-exposed applications.
**Output interpretation:**
- Score close to 1.0 = non-fouling - Score close to 0.0 = fouling --- ### ๐Ÿชฃ Permeability Prediction
Predicts membrane permeability on a log P scale.
**Output interpretation:**
- Higher values = more permeable (>-6.0) - For penetrance predictions, it is a classification prediction, so within the [0, 1] range, closer to 1 indicates more permeable. --- ### โฑ๏ธ Half-Life Prediction
**Interpretation:** Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation.
--- ### โ˜ ๏ธ Toxicity Prediction
**Interpretation:** Outputs a probability (0โ€“1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk.
--- ### ๐Ÿ”— Binding Affinity Prediction
Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.
**Interpretation:**
- Scores โ‰ฅ 9 correspond to tight binders (K โ‰ค 10โปโน M, nanomolar to picomolar range)
- Scores between 7 and 9 correspond to medium binders (10โปโทโ€“10โปโน M, nanomolar to micromolar range)
- Scores < 7 correspond to weak binders (K โ‰ฅ 10โปโถ M, micromolar and weaker)
- A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.
--- ### Uncertainty Interpretation
#### Entropy (classifiers)
Binary predictive entropy of the output probability pฬ„:
$$\mathcal{H} = -\bar{p}\log\bar{p} - (1 - \bar{p})\log(1 - \bar{p})$$
- For **DNN classifiers**: pฬ„ is the mean probability across 5 independently seeded models (deep ensemble). High entropy reflects both epistemic uncertainty (seed disagreement) and aleatoric uncertainty (collectively diffuse predictions).
- For **XGBoost / SVM / ElasticNet classifiers**: pฬ„ is the single model's output probability (or sigmoid of decision function for ElasticNet). Entropy reflects output confidence of a single model only.
| Range | Interpretation | |---|---| | < 0.1 | High confidence | | 0.1 โ€“ 0.4 | Moderate uncertainty | | 0.4 โ€“ 0.6 | Low confidence | | > 0.6 | Very low confidence โ€” model close to guessing | | โ‰ˆ 0.693 | Maximum uncertainty โ€” predicted probability โ‰ˆ 0.5 | --- #### Adaptive Conformal Prediction Interval (regressors)
Returned as a tuple `(lo, hi)` with 90% marginal coverage guarantee.
We implement the **residual normalised conformity score** following [Lei et al. (2018)](https://doi.org/10.1080/01621459.2017.1307116) and [Cordier et al. (2023) / MAPIE](https://proceedings.mlr.press/v204/cordier23a.html). An auxiliary XGBoost model $\hat{\sigma}(\mathbf{x})$ is trained on held-out embeddings and absolute residuals |yแตข โˆ’ ลทแตข|. At inference:
$$[\hat{y}(\mathbf{x}) - q \cdot \hat{\sigma}(\mathbf{x}),\ \hat{y}(\mathbf{x}) + q \cdot \hat{\sigma}(\mathbf{x})]$$ where q is the โŒˆ(n+1)(1โˆ’ฮฑ)โŒ‰ / n quantile of the normalized scores sแตข = |yแตข โˆ’ ลทแตข| / ฯƒฬ‚(xแตข). - **Interval width varies per input** -- molecules more dissimilar to training data tend to receive wider intervals
- **Coverage guarantee**: on exchangeable data, P(y โˆˆ [ลท โˆ’ qฯƒฬ‚, ลท + qฯƒฬ‚]) โ‰ฅ 0.90
- **The guarantee is marginal**, not conditional, as an unusually narrow interval on an out-of-distribution molecule does not guarantee correctness
- **Full access**: We already computed MAPIE for all regression models; users are allowed to directly use them for customized model lists.
--- #### Generating a MAPIE Bundle for a New Model
To enable conformal uncertainty for a newly trained regression model:
```bash # Fit adaptive conformal bundle from val_predictions.csv python fit_mapie_adaptive.py --root training_classifiers --prop ``` The script reads `sequence`/`smiles` and `y_pred`/`y_true` columns from the CSV, recomputes embeddings, fits the XGBoost $\hat{\sigma}$ model, and saves `mapie_calibration.joblib` into the model directory. The bundle is automatically detected and loaded by `PeptiVersePredictor` on the next initialization.
## Model Architecture ๐ŸŒŸ - **Sequence Embeddings:** [ESM-2 650M model](https://huggingface.co/facebook/esm2_t33_650M_UR50D) / [PeptideCLM model](https://huggingface.co/aaronfeller/PeptideCLM-23M-all) / [ChemBERTa](https://huggingface.co/DeepChem/ChemBERTa-77M-MLM). Foundational embeddings are frozen. - **XGBoost Model:** Gradient boosting on pooled embedding features for efficient, high-performance prediction. - **CNN/Transformer Model:** One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns. - **Binding Model:** Transformer-based architecture with cross-attention between protein and peptide representations. - **SVR Model:** Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets. - **Others:** SVM and Elastic Nets were trained with [RAPIDS cuML](https://github.com/rapidsai/cuml), which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository. ## Troubleshooting ๐ŸŒŸ ### LFS Download Issues If files appear as SHA pointers: ```bash huggingface-cli download ChatterjeeLab/PeptiVerse \ training_data_cleaned/hemolysis/hemo_smiles_meta_with_split.csv \ --local-dir . \ --local-dir-use-symlinks False ``` ## Citation ๐ŸŒŸ If you find this repository helpful for your publications, please consider citing our paper: ``` @article {Zhang2025.12.31.697180, author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam}, title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction}, elocation-id = {2025.12.31.697180}, year = {2026}, doi = {10.64898/2025.12.31.697180}, publisher = {Cold Spring Harbor Laboratory}, URL = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180}, eprint = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180.full.pdf}, journal = {bioRxiv} } ``` To use this repository, you agree to abide by the MIT License.