--- library_name: transformers pipeline_tag: feature-extraction model_name: InstaDeepAI/IDP-ESM2-150M --- # IDP-ESM2-8M **IDP-ESM2-150M** is an ESM2-style encoder for intrinsically disorded protein sequence representation learning, trained on [IDP-Euka-90](https://huggingface.co/datasets/InstaDeepAI/IDP-Euka-90). This repository provides a Transformer encoder suitable for extracting **per-sequence embeddings** (mean-pooled over residues with padding masked out). --- ## Quick start: generate embeddings The snippet below loads the tokenizer and model, runs a forward pass on a couple of sequences and extracts embeddings for each sequence. ```python from transformers import AutoTokenizer, AutoModel import torch # --- Config --- model_name = "InstaDeepAI/IDP-ESM2-150M" # --- Load model and tokenizer --- tokenizer = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D") model = AutoModel.from_pretrained(model_name) model.eval() # (optional) use GPU if available device = "cuda" if torch.cuda.is_available() else "cpu" model.to(device) # --- Input sequences --- sequences = [ "MDDNHYPHHHHNHHNHHSTSGGCGESQFTTKLSVNTFARTHPMIQNDLIDLDLISGSAFTMKSKSQQ", "PADRDLSSPFGSTVPGVGPNAAAASNAAAAAAAAATAGSNKHQTPPTTFR", ] # --- Tokenize --- inputs = tokenizer( sequences, return_tensors="pt", padding=True, truncation=True, ) inputs = {k: v.to(device) for k, v in inputs.items()} # --- Forward pass --- with torch.no_grad(): outputs = model(**inputs) embeddings = outputs.last_hidden_state # shape: (batch, seq_len, hidden_dim)