--- license: other license_name: scienta-lab-eva-model-license license_link: LICENSE language: - en tags: - biology - transcriptomics - rna-seq - gene-expression - foundation-model - single-cell - bulk-rna - immunology library_name: transformers pipeline_tag: feature-extraction --- # EVA-RNA: Foundation Model for Transcriptomics Transformer-based foundation model that produces sample-level and gene-level embeddings from RNA-seq profiles (bulk, microarray, pseudobulked single-cell) in human and mouse. ## Installation We recommend proceeding with the [uv package manager](https://docs.astral.sh/uv/getting-started/installation/). ```bash uv venv --python 3.10 source .venv/bin/activate uv pip install transformers torch==2.6.0 scanpy anndata tqdm scipy scikit-misc ``` ### Optional: Flash Attention To handle larger gene contexts, EVA-RNA automatically runs on Flash Attention if available -- only available for post-Ampere GPUs (A100 and beyond). We recommend using the following wheel. ```bash uv pip install https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl ``` ## Quick Start ```python import scanpy as sc from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True) # Load example dataset (2,700 PBMCs, raw counts) adata = sc.datasets.pbmc3k() # Subset to 2,000 highly variable genes for efficiency sc.pp.highly_variable_genes(adata, n_top_genes=2000, flavor="seurat_v3") adata = adata[:, adata.var.highly_variable].copy() # Encode (gene symbols auto-converted, preprocessing applied, GPU used if available) embeddings = model.encode_anndata(tokenizer, adata) adata.obsm["X_eva"] = embeddings ``` ### Options `model.encode_anndata()` accepts the following parameters: - `gene_column` — column in `adata.var` with gene identifiers (default: uses `adata.var_names`) - `species` — `"human"` or `"mouse"` for gene ID conversion (default: auto-detected) - `batch_size` — samples per inference batch (default: 32) - `device` — `"cpu"`, `"cuda"`, etc. (default: CUDA if available) - `show_progress` — show a progress bar (default: True) - `preprocess` — apply library-size normalization + log1p (default: True); set to False if data is already log-transformed ## Advanced: Raw Tensor API For users who need direct control over inputs (mixed precision is applied automatically): ```python import torch from transformers import AutoModel, AutoTokenizer model = AutoModel.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True) tokenizer = AutoTokenizer.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True) device = "cuda" if torch.cuda.is_available() else "cpu" model = model.to(device).eval() # Gene IDs must be NCBI GeneIDs as strings gene_ids = ["7157", "675", "672"] # TP53, BRCA2, BRCA1 expression_values = [5.5, 3.2, 4.1] # log1p-normalized inputs = tokenizer(gene_ids, expression_values, padding=True, return_tensors="pt") inputs = {k: v.to(device) for k, v in inputs.items()} with torch.inference_mode(): outputs = model(**inputs) sample_embedding = outputs.cls_embedding # (1, 256) gene_embeddings = outputs.gene_embeddings # (1, 3, 256) ``` ### Batch Processing ```python batch_gene_ids = [ ["7157", "675", "672"], ["7157", "1956", "5290"], ] batch_expression = [ [5.5, 3.2, 4.1], [2.1, 6.3, 1.8], ] inputs = tokenizer(batch_gene_ids, batch_expression, padding=True, return_tensors="pt") inputs = {k: v.to(device) for k, v in inputs.items()} with torch.inference_mode(): outputs = model(**inputs) sample_embeddings = outputs.cls_embedding # (2, 256) ``` ## Expression Decoder EVA-RNA includes a pre-trained deterministic expression decoder that maps gene embeddings back to predicted expression values. ```python with torch.inference_mode(): # Encode output = model.encode(**inputs) # output.cls_embedding — sample-level embedding (batch, hidden_size) # output.gene_embeddings — per-gene embeddings (batch, n_genes, hidden_size) # Decode expression values predicted_expression = model.decode(output.gene_embeddings) # predicted_expression — (batch, n_genes) ``` ## GPU and Precision EVA-RNA automatically applies mixed precision for optimal performance: - **Ampere+ GPUs** (A100, H100, RTX 30/40 series): bfloat16 - **Older CUDA GPUs** (V100, RTX 20 series): float16 - **CPU**: full precision (float32) No manual `torch.autocast()` is needed. > **Note — Flash Attention constraints:** When flash attention is installed and an > Ampere+ GPU is detected, the model uses flash attention layers. These layers > **require CUDA and half-precision inputs**. If you move the model to CPU you will > get a clear error asking you to move it back to GPU. If you pass `autocast=False`, > autocast is re-enabled automatically with a warning since flash attention cannot > run in full precision. ### Disabling Automatic Mixed Precision For advanced use cases requiring manual precision control, pass `autocast=False`. This only takes effect when flash attention is **not** active (i.e., on older GPUs or when flash attention is not installed): ```python model = model.to("cuda").eval() with torch.inference_mode(): # Disable automatic mixed precision (ignored when flash attention is active) outputs = model(**inputs, autocast=False) # Or via sample_embedding embedding = model.sample_embedding( gene_ids=gene_ids, expression_values=values, autocast=False, ) ``` ## Converting Gene Symbols to NCBI Gene IDs The tokenizer vocabulary uses NCBI GeneIDs. A built-in gene mapper is included to convert gene symbols or Ensembl IDs: ```python tokenizer = AutoTokenizer.from_pretrained("ScientaLab/eva-rna", trust_remote_code=True) # Available mappings: # "symbol_to_ncbi" – human gene symbols → NCBI GeneIDs # "ensembl_to_ncbi" – human Ensembl IDs → NCBI GeneIDs # "symbol_to_ncbi_mouse" – mouse gene symbols → NCBI GeneIDs mapper = tokenizer.gene_mapper["symbol_to_ncbi"] gene_symbols = ["TP53", "BRCA2", "BRCA1"] gene_ids = [mapper[s] for s in gene_symbols] # gene_ids = ["7157", "675", "672"] expression_values = [5.5, 3.2, 4.1] inputs = tokenizer(gene_ids, expression_values, padding=True, return_tensors="pt") ``` ## Citation ```bibtex @article{eva-rna, title={EVA: Towards a universal model of the immune system}, author={Scienta Team}, journal={arXiv}, year={2026}, } ``` ## License [Scienta Lab EVA Model License](LICENSE)