ibm-research/biomed.rna.llama.32m.mlm.multitask.v1
Biomedical foundational models for omics data. This package supports the development of foundation models for scRNA or for DNA data.
biomed-multi-omic enables development and testing of foundation models for DNA sequences and for RNA expression,
with modular model and training methods for pretraining and fine-tuning, controllable via a declarative no-code interface.
biomed-multi-omic leverages anndata, HuggingFace Transformers, PyTorchLighting and Hydra.
- ๐งฌ A single package for DNA and RNA Foundation models. scRNA pretraining on h5ad files or TileDB (eg CellXGene), DNA pretraining on reference human genome (GRCh38/hg38) and also variant imputed genome based on common SNPs available from GWAT catalog and ClinVar datasets.
- ๐ Leverages latest open source tools: anndata, HuggingFace transformers and PyTorchLighting
- ๐ Zero-shot and finetuning support for diverse downstream tasks: (cell type annotation, perturbation prediction for scRNA, promoter prediction task and regulatory regions using Massively parallel reporter assays (MPRAs) for DNA sequences)
- Novel pretraining strategies for scRNA and DNA implemented alongside existing methods to enable experimentation and comparison.
For details on how the models were trained, please refer to the BMFM-RNA preprint.
- Developers: IBM Research
- GitHub Repository: https://github.com/BiomedSciAI/biomed-multi-omic
- Paper: BMFM-RNA: whole-cell expression decoding enables efficient transcriptomic foundation models
- Release Date: Mar 25th, 2026
- License: Apache 2.0
Checkpoint
Masked Language Modeling (MLM): Masked expression prediction and masked gene prediction.
Multitask objectives: multi-label classification (cell type, tissue, tissue general), and an adversarial loss to unlearn donor ID.
MLM + Multitask: Trained with both masked gene and masked expression objectives, along with multi-label classification (cell type, tissue, tissue general), and an adversarial loss to unlearn donor ID.
See section 2.3.2 of the BMFM-RNA manuscript for more details.
Usage
Using biomed.rna.llama.32m.mlm.multitask.v1 requires the codebase https://github.com/BiomedSciAI/biomed-multi-omic
For installation, please follow the instructions on github.
RNA Inference
To get embeddings and predictions for scRNA data run:
export MY_DATA_FILE=... # path to h5ad file with raw counts and gene symbols
bmfm-targets-run -cn predict input_file=$MY_DATA_FILE working_dir=/tmp checkpoint=ibm-research/biomed.rna.llama.32m.mlm.multitask.v1
For more examples see the RNA tutorials on github.
Citation
@misc{danziger2026bmfmrnawholecellexpressiondecoding,
title={BMFM-RNA: whole-cell expression decoding improves transcriptomic foundation models},
author={Michael M. Danziger and Bharath Dandala and Viatcheslav Gurev and Matthew Madgwick and Sivan Ravid and Tim Rumbell and Akira Koseki and Tal Kozlovski and Ching-Huei Tsou and Ella Barkan and Tanwi Biswas and Jielin Xu and Yishai Shimoni and Jianying Hu and Michal Rosen-Zvi},
year={2026},
eprint={2506.14861},
archivePrefix={arXiv},
primaryClass={q-bio.GN},
url={https://arxiv.org/abs/2506.14861},
}
- Downloads last month
- 69