Spaces:

ChatterjeeLab
/

PeptiVerse

Running

App Files Files Community

PeptiVerse / description.md

yinuozhang

update model

8d63dc0 5 days ago

preview code

raw

history blame contribute delete

8.1 kB

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

Input Requirements and Constraints

Supported Inputs

Amino acid sequences: Linear peptides composed of standard 20 amino acids
SMILES: Chemically modified peptides, including cyclization, D-amino acids, and noncanonical resiudes

Validation
Invalid sequences or SMILES will be rejected
Properties not supported are labeled as (Not Supported)

Training Data Collection

Data distribution. Classification tasks report counts for class 0/1; regression tasks report total sample size (N). AA stands for amino acid-based sequences.

Classification (counts for class 0 / 1)

Property	AA (0)	AA (1)	SMILES (0)	SMILES (1)
Hemolysis	4765	1311	4765	1311
Non-Fouling	13580	3600	13580	3600
Solubility	9668	8785	9668	8785
Permeability (Penetrance)	1162	1162	1162	1162
Toxicity	–	–	5518	5518

Regression (total N)

Property	AA (N)	SMILES (N)
Permeability (PAMPA)	–	6869
Permeability (CACO2)	–	606
Half-Life	130	245
Binding Affinity	1433	1702

Our models are trained on curated datasets from multiple sources. For detailed cleaning up procedures please refer to our paper.

🩸 Hemolysis Dataset

Primary Source: the Database of Antimicrobial Activity and Structure of Peptides (DBAASPv3)
Secondary Source: peptide-dashboard
Description: Probability of peptide disrupting red blood cell membranes.
Interpretation 50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. Scores close to 1 indicate a high probability of red blood cell membrane disruption, while scores close to 0 indicate low hemolytic risk. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate.

💧 Solubility Dataset

Primary Source: PROSO-II
Secondary Source: peptideBERT
Description: Probability of peptide remaining dissolved in aqueous conditions.
Interpretation: Outputs a probability (0–1) that a peptide remains soluble in aqueous conditions. Higher scores indicate lower aggregation risk and better formulation stability.

👯 Non-Fouling Dataset

Primary Source: Classifying antimicrobial and multifunctional peptides with Bayesian network models
Secondary Source: peptideBERT
Description: A nonfouling peptide resists nonspecific interactions and protein adsorption.
Interpretation: Outputs the probability (0–1) that a peptide resists nonspecific protein adsorption. Higher scores indicate stronger non-fouling behavior, desirable for circulation and surface-exposed applications.

🪣 Permeability Dataset

Primary Source: CycPeptMPDB, PAMPA
Secondary Source: PepLand
Description: Probability of peptide penetrating the cell membrane.
Interpretation: For PAMPA and CACO-2 regression, outputs are log-scaled permeability values. Following CycPeptMPDB conventions, log Pexp ≥ −6.0 indicates favorable permeability, while values below −6.0 indicate weak permeability. For penetrance prediction, the probability closer to 1 indicates higher risk of cell penetrance, and vice versa.

⏱️ Half-Life Dataset

Primary Source: Thpdb2, PepTherDia, peplife
Interpretation: Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation.

☠️ Toxicity Dataset

Primary Source: ToxinPred3.0
Interpretation: Outputs a probability (0–1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk.

🔗 Binding Affinity Dataset

Primary Source: PepLand
Description: Binding probability normalized in PepLand already. It's a combination of Kd, Ki, IC50.
Description: The model predicts a continuous binding affinity score, where higher values indicate stronger binding. Scores are comparable across peptides binding to the same protein target.
Interpretation:
- Scores ≥ 9 correspond to tight binders (K ≤ 10⁻⁹ M, nanomolar to picomolar range)
- Scores between 7 and 9 correspond to medium binders (10⁻⁷–10⁻⁹ M, nanomolar to micromolar range)
- Scores < 7 correspond to weak binders (K ≥ 10⁻⁶ M, micromolar and weaker)
- A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.

Model Architecture

Sequence Embeddings: ESM-2 650M model / PeptideCLM model. Foundational embeddings are frozen.
XGBoost Model: Gradient boosting on pooled embedding features for efficient, high-performance prediction.
CNN/Transformer Model: One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns.
Binding Model: Transformer-based architecture with cross-attention between protein and peptide representations.
SVR Model: Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets.
Others: SVM and Elastic Nets were trained with RAPIDS cuML, which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository.

Model Training and Weight Hosting

More instructions can be found here at PeptiVersse
Model uncertainty predictions is not supported for the app version, but the code is available at PeptiVersse for local deployment.

🧪 Physicochemical Properties

Net Charge Calculation

Uses Henderson-Hasselbalch equation
pH-dependent calculation
Considers all ionizable groups (K, R, H, D, E, C, Y, termini)

Isoelectric Point (pI)

Bisection method to find pH where net charge = 0
Precision: ±0.01 pH units

Hydrophobicity (GRAVY)

Grand Average of Hydropathy
Uses Kyte-Doolittle scale
Range: -4.5 (hydrophilic) to +4.5 (hydrophobic)

Citation

If you use this tool, please cite:

@article {Zhang2025.12.31.697180,
    author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam},
    title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction},
    elocation-id = {2025.12.31.697180},
    year = {2026},
    doi = {10.64898/2025.12.31.697180},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180},
    eprint = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180.full.pdf},
    journal = {bioRxiv}
}

Contact

For questions or collaborations: yzhang@u.duke.nus.edu or pranam@seas.upenn.edu