Protein-Protein Interaction Site Prediction

This model is a finetuned version of ESM2-3B [1] for protein-protein interaction site prediction. It predicts whether a certain amino acid in a protein sequence is part of an interaction site (1) or not (0).

For more details on the training and testing on this model, refer to the article [...].

The github repository to use with this model is available here: https://github.com/RitAreaSciencePark/PPI-Reps

The data for the training and evaluation of this model is available in csv format in this zenodo repository: https://doi.org/10.5281/zenodo.18802482

How to Get Started with the Model

import torch
from transformers import AutoModel, AutoTokenizer, AutoConfig

model_name = "evillegasgarcia/esm2-ppi-biolip-1"

# Load config 
config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
# Load model using the custom remote code
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)


#move model to device
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# run over a sample sequence
sequence = "MKTVRQERLKSIVRILEAAKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"

inputs = tokenizer.encode(sequence, return_tensors="pt").to(device)
logits = model(inputs)["logits"]
probabilities = torch.sigmoid(logits)

probabilities

Training Details

The model was trained on a curated subset of the biolip dataset taken from [2]. We used the Adam optimizer with default hyperparameters, and weight decay of 0.05. The learning rate was 1e-5 and we had a gradient accumulation batch size of 2.

Evaluation

The performance of the model was tested on the ZK448 benchmark available from the Zenodo repository and originally curated by [3]. The model has an accuracy of 0.74 and a Matthews Correlation Coefficient (MCC) score of 0.35.

References

Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., ... & Rives, A. (2023). Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637), 1123-1130.
Zhang, J., & Kurgan, L. (2018). Review and comparative assessment of sequence-based predictors of protein-binding residues. Briefings in bioinformatics, 19(5), 821-837.

Downloads last month: 9

Safetensors

Model size

3B params

Tensor type

F32