PeptiVerse / description.md
yinuozhang's picture
update model
8d63dc0

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

Input Requirements and Constraints

Supported Inputs

  • Amino acid sequences: Linear peptides composed of standard 20 amino acids
  • SMILES: Chemically modified peptides, including cyclization, D-amino acids, and noncanonical resiudes

    Validation

  • Invalid sequences or SMILES will be rejected
  • Properties not supported are labeled as (Not Supported)

Training Data Collection

Data distribution. Classification tasks report counts for class 0/1; regression tasks report total sample size (N). AA stands for amino acid-based sequences.

Classification (counts for class 0 / 1)

Property AA (0) AA (1) SMILES (0) SMILES (1)
Hemolysis 4765 1311 4765 1311
Non-Fouling 13580 3600 13580 3600
Solubility 9668 8785 9668 8785
Permeability (Penetrance) 1162 1162 1162 1162
Toxicity – – 5518 5518

Regression (total N)

Property AA (N) SMILES (N)
Permeability (PAMPA) – 6869
Permeability (CACO2) – 606
Half-Life 130 245
Binding Affinity 1433 1702

Our models are trained on curated datasets from multiple sources. For detailed cleaning up procedures please refer to our paper.

🩸 Hemolysis Dataset

  • Primary Source: the Database of Antimicrobial Activity and Structure of Peptides (DBAASPv3)
  • Secondary Source: peptide-dashboard
  • Description: Probability of peptide disrupting red blood cell membranes.
  • Interpretation 50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. Scores close to 1 indicate a high probability of red blood cell membrane disruption, while scores close to 0 indicate low hemolytic risk. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate.

πŸ’§ Solubility Dataset

  • Primary Source: PROSO-II
  • Secondary Source: peptideBERT
  • Description: Probability of peptide remaining dissolved in aqueous conditions.
  • Interpretation: Outputs a probability (0–1) that a peptide remains soluble in aqueous conditions. Higher scores indicate lower aggregation risk and better formulation stability.

πŸ‘― Non-Fouling Dataset

πŸͺ£ Permeability Dataset

  • Primary Source: CycPeptMPDB, PAMPA
  • Secondary Source: PepLand
  • Description: Probability of peptide penetrating the cell membrane.
  • Interpretation: For PAMPA and CACO-2 regression, outputs are log-scaled permeability values. Following CycPeptMPDB conventions, log Pexp β‰₯ βˆ’6.0 indicates favorable permeability, while values below βˆ’6.0 indicate weak permeability. For penetrance prediction, the probability closer to 1 indicates higher risk of cell penetrance, and vice versa.

⏱️ Half-Life Dataset

  • Primary Source: Thpdb2, PepTherDia, peplife
  • Interpretation: Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation.

☠️ Toxicity Dataset

  • Primary Source: ToxinPred3.0
  • Interpretation: Outputs a probability (0–1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk.

πŸ”— Binding Affinity Dataset

  • Primary Source: PepLand
  • Description: Binding probability normalized in PepLand already. It's a combination of Kd, Ki, IC50.
  • Description: The model predicts a continuous binding affinity score, where higher values indicate stronger binding. Scores are comparable across peptides binding to the same protein target.
  • Interpretation:
    • Scores β‰₯ 9 correspond to tight binders (K ≀ 10⁻⁹ M, nanomolar to picomolar range)
    • Scores between 7 and 9 correspond to medium binders (10⁻⁷–10⁻⁹ M, nanomolar to micromolar range)
    • Scores < 7 correspond to weak binders (K β‰₯ 10⁻⁢ M, micromolar and weaker)
    • A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.

Model Architecture

  • Sequence Embeddings: ESM-2 650M model / PeptideCLM model. Foundational embeddings are frozen.
  • XGBoost Model: Gradient boosting on pooled embedding features for efficient, high-performance prediction.
  • CNN/Transformer Model: One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns.
  • Binding Model: Transformer-based architecture with cross-attention between protein and peptide representations.
  • SVR Model: Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets.
  • Others: SVM and Elastic Nets were trained with RAPIDS cuML, which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository.

Model Training and Weight Hosting

  • More instructions can be found here at PeptiVersse
  • Model uncertainty predictions is not supported for the app version, but the code is available at PeptiVersse for local deployment.

πŸ§ͺ Physicochemical Properties

Net Charge Calculation

  • Uses Henderson-Hasselbalch equation
  • pH-dependent calculation
  • Considers all ionizable groups (K, R, H, D, E, C, Y, termini)

Isoelectric Point (pI)

  • Bisection method to find pH where net charge = 0
  • Precision: Β±0.01 pH units

Hydrophobicity (GRAVY)

  • Grand Average of Hydropathy
  • Uses Kyte-Doolittle scale
  • Range: -4.5 (hydrophilic) to +4.5 (hydrophobic)

Citation

If you use this tool, please cite:

@article {Zhang2025.12.31.697180,
    author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam},
    title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction},
    elocation-id = {2025.12.31.697180},
    year = {2026},
    doi = {10.64898/2025.12.31.697180},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180},
    eprint = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180.full.pdf},
    journal = {bioRxiv}
}

Contact

For questions or collaborations: yzhang@u.duke.nus.edu or pranam@seas.upenn.edu