Joblib
PeptiVerse / README.md
yinuozhang's picture
Update README.md
e719470 verified
metadata
license: apache-2.0

Overview of PeptiVerse

PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction 🧬🌌

This is the repository for PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction, a collection of machine learning predictors for canonical and non-canonical peptide property prediction using sequence and SMILES representations. 🧬 PeptiVerse 🌌 enables evaluation of key biophysical and therapeutic properties of peptides for property-optimized generation.

Table of Contents 🌟

Quick Start 🌟

  • Light-weighted start (basic models, no cuML, read below for details)
# Ignore all LFS files, you will see an empty folder first
git clone --no-checkout https://huggingface.co/ChatterjeeLab/PeptiVerse
cd PeptiVerse

# Enable sparse checkout
git sparse-checkout init --cone

# Choose only selective items to download
git sparse-checkout set \
  inference.py \
  download_light.py \
  best_models.txt \
  basic_models.txt \
  requirements.txt \
  tokenizer \
  README.md

# Now checkout
GIT_LFS_SKIP_SMUDGE=1 git checkout

# Install basic pkgs
pip install -r requirements.txt

# Download basic model weights according to the basic_models.txt. Adjust which config you wanted as needed.
python download_light.py

# Test in inference
python inference.py
  • Full model clone (will clone all model weights)
# Clone repository
git clone https://huggingface.co/ChatterjeeLab/PeptiVerse

# Install dependencies
pip install -r requirements.txt

# Run inference
python inference.py

Installation 🌟

Minimal Setup

  • Easy start-up environment (using transformers, xgboost models)
pip install -r requirements.txt

Full Setup

  • Additional access to trained SVM and ElastNet models requires installation of RAPIDS cuML, with instructions available from their official github page (CUDA-capable GPU required).
  • Optional: pre-compiled Singularity/Apptainer environment (7.52G) is available at Google drive with everything you need (still need CUDA/GPU to load cuML models).
    # test
    apptainer exec peptiverse.sif python -c "import sys; print(sys.executable)"
    
    # run inference (see below)
    apptainer exec peptiverse.sif python inference.py
    

Repository Structure 🌟

This repo contains important large files for PeptiVerse, an interactive app for peptide property prediction. Paper link.

PeptiVerse/
β”œβ”€β”€ training_data_cleaned/     # Processed datasets with embeddings
β”‚   └── <property>/            # Property-specific data
β”‚       β”œβ”€β”€ train/val splits
β”‚       └── precomputed embeddings
β”œβ”€β”€ training_classifiers/      # Trained model weights
β”‚   └── <property>/           
β”‚       β”œβ”€β”€ cnn_wt/           # CNN architectures
β”‚       β”œβ”€β”€ mlp_wt/           # MLP architectures
β”‚       └── xgb_wt/           # XGBoost models
β”œβ”€β”€ tokenizer/                 # PeptideCLM tokenizer
β”œβ”€β”€ training_data/             # Raw training data
β”œβ”€β”€ inference.py               # Main prediction interface
β”œβ”€β”€ best_models.txt            # Model selection manifest
└── requirements.txt           # Python dependencies

For full data access, please download the corresponding training_data_cleaned and training_classifiers from zenodo. The current Huggingface repo only hosts best model weights and meta data with splits labels.

Training Data Collection 🌟

Data distribution. Classification tasks report counts for class 0/1; regression tasks report total sample size (N).
Properties Amino Acid Sequences SMILES Sequences
0 1 0 1
Classification
Hemolysis 4765 1311 4765 1311
Non-Fouling 13580 3600 13580 3600
Solubility 9668 8785 9668 8785
Permeability (Penetrance) 1162 1162 1162 1162
Toxicity - - 5518 5518
Regression (N)
Permeability (PAMPA) - 6869
Permeability (CACO2) - 606
Half-Life 130 245
Binding Affinity 1436 1597

Best Model List 🌟

Full model set (cuML-enabled)

Property Best Model (Sequence) Best Model (SMILES) Task Type Threshold (Sequence) Threshold (SMILES)
Hemolysis SVM CNN (chemberta) Classifier 0.2521 0.564
Non-Fouling Transformer ENET (peptideclm) Classifier 0.57 0.6969
Solubility CNN – Classifier 0.377 –
Permeability (Penetrance) SVM SVM (chemberta) Classifier 0.5493 0.573
Toxicity – CNN (chemberta) Classifier – 0.49
Binding Affinity unpooled unpooled Regression – –
Permeability (PAMPA) – CNN (chemberta) Regression – –
Permeability (Caco-2) – SVR (chemberta) Regression – –
Half-life Transformer XGB (peptideclm) Regression – –

Note: unpooled indicates models operating on token-level embeddings with cross-attention, rather than mean-pooled representations.

Minimal deployable model set (no cuML)

Property Best Model (WT) Best Model (SMILES) Task Type Threshold (WT) Threshold (SMILES)
Hemolysis XGB CNN (chemberta) Classifier 0.2801 0.564
Non-Fouling Transformer XGB (peptideclm) Classifier 0.57 0.3892
Solubility CNN – Classifier 0.377 –
Permeability (Penetrance) XGB XGB (chemberta) Classifier 0.4301 0.5028
Toxicity – CNN (chemberta) Classifier – 0.49
Binding Affinity wt_wt_pooled chemberta_smiles_pooled Regression – –
Permeability (PAMPA) – CNN (chemberta) Regression – –
Permeability (Caco-2) – SVR (chemberta) Regression – –
Half-life Transformer XGB (peptideclm) Regression – –

Note: Models marked as SVM or ENET are replaced with XGB as these models are not currently supported in the deployment environment without cuML setups.

Usage 🌟

Local Application Hosting

# Configure models in best_models.txt

git clone https://huggingface.co/spaces/ChatterjeeLab/PeptiVerse
python app.py

Data pre-processing

Under the training_data_cleaned, we provided the generated embeddings in huggingface dataset format. The following scripts are the steps used to generate the data.

Dataset integration

  • All properties are provided with raw_data/split_ready_csvs/huggingface_datasets.
  • Selective download the data you need with huggingface-cli
huggingface-cli download ChatterjeeLab/PeptiVerse \
  --include "training_data_cleaned/**" \     # only this folder
  --exclude "**/*.pt" "**/*.joblib" \     # skip weights/artifacts
  --local-dir PeptiVerse_partial \
  --local-dir-use-symlinks False      # make real copies
  • Or in python
from huggingface_hub import snapshot_download

local_dir = snapshot_download(
    repo_id="ChatterjeeLab/PeptiVerse",
    allow_patterns=["training_data_cleaned/**"],     # only this folder
    ignore_patterns=["**/*.pt", "**/*.joblib"],     # skip weights/artifacts
    local_dir="PeptiVerse_partial",
    local_dir_use_symlinks=False,                   # make real copies
)
print("Downloaded to:", local_dir)
  • Usage of the huggingface datasets (with pre-computed embeddings and splits)
    • All embedding datasets are saved via DatasetDict.save_to_disk and loadable with:
    from datasets import load_from_disk
    ds = load_from_disk(PATH)
    train_ds = ds["train"]
    val_ds = ds["val"]
    
  • A) Sequence Based (ESM-2 embeddings)
    • Pooled (fixed-length vector per sequence)
      • Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding.
      • Each item: sequence: str label: int (classification) or float (regression) embedding: float32[H] (H=1280 for ESM-2 650M)
    • Unpooled (variable-length token matrix)
      • Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix.
      • Each item: sequence: str label: int (classification) or float (regression) embedding: float16[L, H] (nested lists) attention_mask: int8[L] length: int (=L)
  • B) SMILES-based (PeptideCLM embeddings)
    • Pooled (fixed-length vector per sequence)
      • Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding.
      • Each item: sequence: str (SMILES) label: int (classification) or float (regression) embedding: float32[H]
    • Unpooled (variable-length token matrix)
      • Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix.
      • Each item: sequence: str (SMILES) label: int (classification) or float (regression) embedding: float16[L, H] (nested lists) attention_mask: int8[L] length: int (=L)
  • C) SMILES-based (ChemBERTa embeddings)
    • Pooled (fixed-length vector per sequence)
      • Generated by mean-pooling token embeddings excluding special tokens (CLS/EOS) and padding.
      • Each item: sequence: str (SMILES) label: int (classification) or float (regression) embedding: float32[H]
    • Unpooled (variable-length token matrix)
      • Generated by keeping all valid token embeddings (excluding special tokens + padding) as a per-sequence matrix.
      • Each item: sequence: str (SMILES) label: int (classification) or float (regression) embedding: float16[L, H] (nested lists) attention_mask: int8[L] length: int (=L)

Training

Under the training_classifiers folder, we provide the python scripts used to train different models. The scripts will

  1. Read the pre-processed Huggingface Dataset from training_data_cleaned folder;
  2. Perform OPTUNA hyperparameter sweep once being called;
  3. All training was conducted on HPC with SLURM script under training_classifiers/src folder;
  4. Customize or isolate certain model training scripts as needed.
Example of training
ML models
HOME_LOC=/home
SCRIPT_LOC=$HOME_LOC/PeptiVerse/training_classifiers
EMB_LOC=$HOME_LOC/PeptiVerse/training_data_cleaned

OBJECTIVE='hemolysis' # nf/solubility/hemolysis/permeability_pampa/permeability_caco2
WT='smiles' # wt/smiles
DATA_FILE="hemo_${WT}_with_embeddings"
LOG_LOC=$SCRIPT_LOC/src/logs
DATE=$(date +%m_%d)
MODEL_TYPE='svm_gpu' # xgb/enet_gpu/svm_gpu
SPECIAL_PREFIX="${MODEL_TYPE}-${OBJECTIVE}-${WT}_new"

# Create log directory if it doesn't exist
mkdir -p $LOG_LOC

cd $SCRIPT_LOC

python -u train_ml.py \
  --dataset_path "${DATA_LOC}/${OBJECTIVE}/${DATA_FILE}" \
  --out_dir "${SCRIPT_LOC}/${OBJECTIVE}/${MODEL_TYPE}_${WT}" \
  --model "${MODEL_TYPE}" \
  --n_trials 200  > "${LOG_LOC}/${DATE}_${SPECIAL_PREFIX}.log" 2>&1
DNN models
HOME_LOC=/home
SCRIPT_LOC=$HOME_LOC/PeptiVerse/training_classifiers
EMB_LOC=$HOME_LOC/PeptiVerse/training_data_cleaned

OBJECTIVE='nf' # nf/solubility/hemolysis
WT='smiles' #wt/smiles
DATA_FILE="nf_${WT}_with_embeddings_unpooled"
LOG_LOC=$SCRIPT_LOC/src/logs
DATE=$(date +%m_%d)
MODEL_TYPE='cnn' #mlp/cnn/transformer
SPECIAL_PREFIX="${MODEL_TYPE}-${OBJECTIVE}-${WT}"

# Create log directory if it doesn't exist
mkdir -p $LOG_LOC

cd $SCRIPT_LOC

python -u train_nn.py \
  --dataset_path "${DATA_LOC}/${OBJECTIVE}/${DATA_FILE}" \
  --out_dir "${SCRIPT_LOC}/${OBJECTIVE}/${MODEL_TYPE}_${WT}" \
  --model "${MODEL_TYPE}" \
  --n_trials 200  > "${LOG_LOC}/${DATE}_${SPECIAL_PREFIX}.log" 2>&1
Binding Affinity
HOME_LOC=/home
SCRIPT_LOC=$HOME_LOC/PeptiVerse/training_classifiers
EMB_LOC=$HOME_LOC/PeptiVerse/training_data_cleaned

OBJECTIVE='binding_affinity'
BINDER_MODEL='chemberta'   # peptideclm / chemberta
STATUS='unpooled'             # pooled / unpooled
TYPE='smiles'
DATA_FILE='pair_wt_${TYPE}_${STATUS}'

LOG_LOC=$SCRIPT_LOC/src/logs
DATE=$(date +%m_%d)
SPECIAL_PREFIX="${OBJECTIVE}-${BINDER_MODEL}-${STATUS}"

python -u binding_training.py \
  --dataset_path "${EMB_LOC}/${OBJECTIVE}/${BINDER_MODEL}/${DATA_FILE}" \
  --mode "${STATUS}" \
  --out_dir "${SCRIPT_LOC}/${OBJECTIVE}/${BINDER_MODEL}_${TYPE}_${STATUS}" \
  --n_trials 200 > "${LOG_LOC}/${DATE}_${SPECIAL_PREFIX}.log" 2>&1

Quick inference by property per model

from inference import PeptiVersePredictor
from pathlib import Path

root = Path(__file__).resolve().parent  # current script folder

    
predictor = PeptiVersePredictor(
    manifest_path=root / "best_models.txt",
    classifier_weight_root=root,
    device="cuda",                            # or "cpu"
)

# mode: smiles (SMILES-based models) / wt (Sequence-based models) 
# property keys (with some level of name normalization)
# hemolysis
# nf (Non-Fouling)
# solubility
# permeability_penetrance
# toxicity
# permeability_pampa
# permeability_caco2
# halflife
# binding_affinity

seq = "GIVEQCCTSICSLYQLENYCN"
smiles = "CC(C)C[C@@H]1NC(=O)[C@@H](CC(C)C)N(C)C(=O)[C@@H](C)N(C)C(=O)[C@H](Cc2ccccc2)NC(=O)[C@H](CC(C)C)N(C)C(=O)[C@H]2CCCN2C1=O"

# Hemolysis
out = pred.predict_property("hemolysis", mode="wt", input_str=seq)
print(out)
# {"property":"hemolysis","mode":"wt","score":prob,"label":0/1,"threshold":...}

out = pred.predict_property("hemolysis", mode="smiles", input_str=smiles)
print(out)

# Non-fouling (key is nf)
out = pred.predict_property("nf", mode="wt", input_str=seq)
print(out)

out = pred.predict_property("nf", mode="smiles", input_str=smiles)
print(out)

# Solubility (Sequence-only)
out = pred.predict_property("solubility", mode="wt", input_str=seq)
print(out)

# Permeability (Penetrance) (Sequence-only)
out = pred.predict_property("permeability_penetrance", mode="wt", input_str=seq)
print(out)

# Toxicity (SMILES-only)
out = pred.predict_property("toxicity", mode="smiles", input_str=smiles)
print(out)

# Permeability (PAMPA) (SMILES regression)
out = pred.predict_property("permeability_pampa", mode="smiles", input_str=smiles)
print(out)
# {"property":"permeability_pampa","mode":"smiles","score":value}

# Permeability (Caco-2) (SMILES regression)
out = pred.predict_property("permeability_caco2", mode="smiles", input_str=smiles)
print(out)

# Half-life (sequence-based + SMILES regression)
out = pred.predict_property("halflife", mode="wt", input_str=seq)
print(out)

out = pred.predict_property("halflife", mode="smiles", input_str=smiles)
print(out)

# Binding Affinity
protein = "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQV..."  # target protein
peptide_seq = "GIVEQCCTSICSLYQLENYCN"

out = pred.predict_binding_affinity(
    mode="wt",
    target_seq=protein,
    binder_str=peptide_seq,
)
print(out)
# {
#   "property":"binding_affinity",
#   "mode":"wt",
#   "affinity": float,
#   "class_by_threshold": "High (β‰₯9)" / "Moderate (7-9)" / "Low (<7)",
#   "class_by_logits": same buckets,
#   "binding_model": "pooled" or "unpooled",
# }

Advanced inference with uncertainty prediction

The uncertainty prediction is added as a parameter in the inference code. The full classifier folder from zenodo is required to enable this functionality. The model uncertainty is reported via all the scripts listed under the training_classifiers folder starting with "refit". Detailed description can be found in the methodology part of the manuscript. At inference time, PeptiVersePredictor returns an uncertainty field with every prediction when uncertainty=True is passed. The method and interpretation depend on the model class, determined automatically at inference time.

seq = "GIGAVLKVLTTGLPALISWIKRKRQQ"
smiles = "C(C)C[C@@H]1NC(=O)[C@@H]2CCCN2C(=O)[C@@H](CC(C)C)NC(=O)[C@@H](CC(C)C)N(C)C(=O)[C@H](C)NC(=O)[C@H](Cc2ccccc2)NC1=O"

print(predictor.predict_property("nf",    "wt",     seq, uncertainty=True))
print(predictor.predict_property("nf",    "smiles",     smiles, uncertainty=True))

{'property': 'nf', 'col': 'wt', 'score': 0.00014520535252195523, 'emb_tag': 'wt', 'label': 0, 'threshold': 0.57, 'uncertainty': 0.0017192508727321288, 'uncertainty_type': 'ensemble_predictive_entropy'}
{'property': 'nf', 'col': 'smiles', 'score': 0.025485480204224586, 'emb_tag': 'peptideclm', 'label': 0, 'threshold': 0.6969, 'uncertainty': 0.11868063130587676, 'uncertainty_type': 'binary_predictive_entropy_single_model'}

Method by Model Class
Model Class Task Uncertainty Method Output Type Range
MLP, CNN, Transformer Classifier Deep ensemble predictive entropy (5 seeds) float [0, ln(2) β‰ˆ 0.693]
MLP, CNN, Transformer Regression Adaptive conformal interval; falls back to ensemble std if no MAPIE bundle (lo, hi) or float unbounded
SVM / SVC / XGBoost Classifier Binary predictive entropy (sigmoid of decision function) float [0, ln(2) β‰ˆ 0.693]
SVR / ElasticNet / XGBoost Regression Adaptive conformal interval (lo, hi) unbounded

Uncertainty is None when: a DNN classifier has no seed ensemble trained, or a regression model has no mapie_calibration.joblib in its model directory.


Interpretation 🌟

You can also find the same description in the paper or in the PeptiVerse app Documentation tab.


🩸 Hemolysis Prediction

50% of read blood cells being lysed at x ug/ml concetration (HC50). If HC50 < 100uM, considered as hemolytic, otherwise non-hemolytic, resulting in a binary 0/1 dataset. The predicted probability should therefore be interpreted as a risk indicator, not an exact concentration estimate.

Output interpretation:

  • Score close to 1.0 = high probability of red blood cell membrane disruption
  • Score close to 0.0 = non-hemolytic

πŸ’§ Solubility Prediction

Outputs a probability (0–1) that a peptide remains soluble in aqueous conditions.

Output interpretation:

  • Score close to 1.0 = highly soluble
  • Score close to 0.0 = poorly soluble

πŸ‘― Non-Fouling Prediction

Higher scores indicate stronger non-fouling behavior, desirable for circulation and surface-exposed applications.

Output interpretation:

  • Score close to 1.0 = non-fouling
  • Score close to 0.0 = fouling

πŸͺ£ Permeability Prediction

Predicts membrane permeability on a log P scale.

Output interpretation:

  • Higher values = more permeable (>-6.0)
  • For penetrance predictions, it is a classification prediction, so within the [0, 1] range, closer to 1 indicates more permeable.

⏱️ Half-Life Prediction

Interpretation: Predicted values reflect relative peptide stability for the unit in hours. Higher scores indicate longer persistence in serum, while lower scores suggest faster degradation.


☠️ Toxicity Prediction

Interpretation: Outputs a probability (0–1) that a peptide exhibits toxic effects. Higher scores indicate increased toxicity risk.


πŸ”— Binding Affinity Prediction

Predicts peptide-protein binding affinity. Requires both peptide and target protein sequence.

Interpretation:

  • Scores β‰₯ 9 correspond to tight binders (K ≀ 10⁻⁹ M, nanomolar to picomolar range)
  • Scores between 7 and 9 correspond to medium binders (10⁻⁷–10⁻⁹ M, nanomolar to micromolar range)
  • Scores < 7 correspond to weak binders (K β‰₯ 10⁻⁢ M, micromolar and weaker)
  • A difference of 1 unit in score corresponds to an approximately tenfold change in binding affinity.

Uncertainty Interpretation

Entropy (classifiers)

Binary predictive entropy of the output probability pΜ„:

H=βˆ’pΛ‰log⁑pΛ‰βˆ’(1βˆ’pΛ‰)log⁑(1βˆ’pΛ‰)\mathcal{H} = -\bar{p}\log\bar{p} - (1 - \bar{p})\log(1 - \bar{p})

  • For DNN classifiers: pΜ„ is the mean probability across 5 independently seeded models (deep ensemble). High entropy reflects both epistemic uncertainty (seed disagreement) and aleatoric uncertainty (collectively diffuse predictions).

  • For XGBoost / SVM / ElasticNet classifiers: pΜ„ is the single model's output probability (or sigmoid of decision function for ElasticNet). Entropy reflects output confidence of a single model only.

Range Interpretation
< 0.1 High confidence
0.1 – 0.4 Moderate uncertainty
0.4 – 0.6 Low confidence
> 0.6 Very low confidence β€” model close to guessing
β‰ˆ 0.693 Maximum uncertainty β€” predicted probability β‰ˆ 0.5

Adaptive Conformal Prediction Interval (regressors)

Returned as a tuple (lo, hi) with 90% marginal coverage guarantee.

We implement the residual normalised conformity score following Lei et al. (2018) and Cordier et al. (2023) / MAPIE. An auxiliary XGBoost model $\hat{\sigma}(\mathbf{x})$ is trained on held-out embeddings and absolute residuals |yα΅’ βˆ’ Ε·α΅’|. At inference:

[y^(x)βˆ’qβ‹…Οƒ^(x), y^(x)+qβ‹…Οƒ^(x)][\hat{y}(\mathbf{x}) - q \cdot \hat{\sigma}(\mathbf{x}),\ \hat{y}(\mathbf{x}) + q \cdot \hat{\sigma}(\mathbf{x})]

where q is the ⌈(n+1)(1βˆ’Ξ±)βŒ‰ / n quantile of the normalized scores sα΅’ = |yα΅’ βˆ’ Ε·α΅’| / ΟƒΜ‚(xα΅’).

  • Interval width varies per input -- molecules more dissimilar to training data tend to receive wider intervals

  • Coverage guarantee: on exchangeable data, P(y ∈ [Ε· βˆ’ qΟƒΜ‚, Ε· + qΟƒΜ‚]) β‰₯ 0.90

  • The guarantee is marginal, not conditional, as an unusually narrow interval on an out-of-distribution molecule does not guarantee correctness

  • Full access: We already computed MAPIE for all regression models; users are allowed to directly use them for customized model lists.


Generating a MAPIE Bundle for a New Model

To enable conformal uncertainty for a newly trained regression model:

# Fit adaptive conformal bundle from val_predictions.csv
python fit_mapie_adaptive.py --root training_classifiers --prop <property_name>

The script reads sequence/smiles and y_pred/y_true columns from the CSV, recomputes embeddings, fits the XGBoost $\hat{\sigma}$ model, and saves mapie_calibration.joblib into the model directory. The bundle is automatically detected and loaded by PeptiVersePredictor on the next initialization.

Model Architecture 🌟

  • Sequence Embeddings: ESM-2 650M model / PeptideCLM model / ChemBERTa. Foundational embeddings are frozen.
  • XGBoost Model: Gradient boosting on pooled embedding features for efficient, high-performance prediction.
  • CNN/Transformer Model: One-dimensional convolutional/self-attention transformer networks operating on unpooled embeddings to capture local sequence patterns.
  • Binding Model: Transformer-based architecture with cross-attention between protein and peptide representations.
  • SVR Model: Support Vector Regression applied to pooled embeddings, providing a kernel-based, nonparametric regression baseline that is robust on smaller or noisy datasets.
  • Others: SVM and Elastic Nets were trained with RAPIDS cuML, which requires a CUDA environment and is therefore not supported in the web app. Model checkpoints remain available in the Hugging Face repository.

Troubleshooting 🌟

LFS Download Issues

If files appear as SHA pointers:

huggingface-cli download ChatterjeeLab/PeptiVerse \
    training_data_cleaned/hemolysis/hemo_smiles_meta_with_split.csv \
    --local-dir . \
    --local-dir-use-symlinks False

Citation 🌟

If you find this repository helpful for your publications, please consider citing our paper:

@article {Zhang2025.12.31.697180,
    author = {Zhang, Yinuo and Tang, Sophia and Chen, Tong and Mahood, Elizabeth and Vincoff, Sophia and Chatterjee, Pranam},
    title = {PeptiVerse: A Unified Platform for Therapeutic Peptide Property Prediction},
    elocation-id = {2025.12.31.697180},
    year = {2026},
    doi = {10.64898/2025.12.31.697180},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180},
    eprint = {https://www.biorxiv.org/content/early/2026/01/03/2025.12.31.697180.full.pdf},
    journal = {bioRxiv}
}

To use this repository, you agree to abide by the MIT License.