Instructions to use QizhiPei/BioMatrix-4B-SFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QizhiPei/BioMatrix-4B-SFT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="QizhiPei/BioMatrix-4B-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("QizhiPei/BioMatrix-4B-SFT")
model = AutoModelForCausalLM.from_pretrained("QizhiPei/BioMatrix-4B-SFT")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use QizhiPei/BioMatrix-4B-SFT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QizhiPei/BioMatrix-4B-SFT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QizhiPei/BioMatrix-4B-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/QizhiPei/BioMatrix-4B-SFT

SGLang

How to use QizhiPei/BioMatrix-4B-SFT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QizhiPei/BioMatrix-4B-SFT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QizhiPei/BioMatrix-4B-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QizhiPei/BioMatrix-4B-SFT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QizhiPei/BioMatrix-4B-SFT",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use QizhiPei/BioMatrix-4B-SFT with Docker Model Runner:
```
docker model run hf.co/QizhiPei/BioMatrix-4B-SFT
```

BioMatrix-4B-SFT

File size: 9,163 Bytes

00a287b
ca2d66e
 
 
00a287b
ca2d66e
 
 
 
 
 
 
 
 
 
 
00a287b
 
ca2d66e
 
 
 
 
 
4a9a443
 
ca2d66e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74b1ffc
ca2d66e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
00a287b
ca2d66e
00a287b
ca2d66e
00a287b
ca2d66e
00a287b
ca2d66e
00a287b
ca2d66e
00a287b
ca2d66e
00a287b
ca2d66e
00a287b
ca2d66e
 
00a287b
ca2d66e
00a287b
ca2d66e
00a287b
ca2d66e
 
 
 
00a287b
ca2d66e
00a287b
ca2d66e
00a287b
ca2d66e
 
 
 
 
 
 
00a287b
ca2d66e
00a287b
ca2d66e

---
license: apache-2.0
language:
- en
tags:
- biology
- chemistry
- molecule
- protein
- multimodal
- foundation-model
- drug-discovery
- protein-design
pipeline_tag: text-generation
base_model: QizhiPei/BioMatrix-4B-Base
library_name: transformers
---

# BioMatrix-4B-SFT

**BioMatrix** is a multimodal biological foundation model that natively integrates **1D sequences**, **3D structures**, and **natural language** for both **molecules** and **proteins** within a single decoder-only architecture.

This is the **4B-parameter SFT (Supervised Fine-Tuned)** variant, instruction-tuned across 80 downstream biological tasks spanning 6 categories.

- 📄 **Paper**: [BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language](https://github.com/QizhiPei/BioMatrix/blob/main/biomatrix_tech_report.pdf)
- 💻 **Code**: [https://github.com/QizhiPei/BioMatrix](https://github.com/QizhiPei/BioMatrix)
- 🤗 **Model & Data Collection**: [https://huggingface.co/collections/QizhiPei/biomatrix](https://huggingface.co/collections/QizhiPei/biomatrix)

## Model Overview

BioMatrix closes the gap between native multimodality and broad entity coverage in biological foundation models. Unlike adapter-based approaches that bolt external encoders onto a language model, or prior native-tokenization models confined to a single entity type, BioMatrix maps **all modalities into a shared discrete token space** via a unified tokenization scheme:

- **Molecular 1D sequences** (both SMILES and SELFIES notations)
- **Molecular 3D structures** (via MolStrucTok with branch-decoupled decoder)
- **Protein 1D sequences** (residue-level tokens)
- **Protein 3D structures** (via GCP-VQVAE backbone tokenizer)
- **Natural language** (inherited from Qwen3 tokenizer)

All modalities are consumed and produced uniformly under a **single next-token prediction objective**—without external encoders, projection adapters, or modality-specific output heads.

| Model | Molecule 1D | Molecule 3D | Protein 1D | Protein 3D | Natural Language |
|-------|:-----------:|:-----------:|:----------:|:----------:|:----------------:|
| ESM3 | ✗ | ✗ | ✓ | ✓ | ✓ |
| 3D-MoLM | ✓ | ✓ | ✗ | ✗ | ✓ |
| AlphaFold3 | ✓ | ✓ | ✓ | ✓ | ✗ |
| BioT5/BioT5+ | ✓ | ✗ | ✓ | ✗ | ✓ |
| BioMedGPT | ✓ | ✗ | ✓ | ✗ | ✓ |
| NatureLM | ✓ | ✗ | ✓ | ✗ | ✓ |
| SciReasoner | ✓ | ✗ | ✓ | ✗ | ✓ |
| **BioMatrix** | **✓** | **✓** | **✓** | **✓** | **✓** |

## Model Details

- **Base Architecture**: Qwen3-4B-Base
- **Parameters**: 4B
- **Training Stages**:
  - **Continual Pretraining** on 304.4B tokens (general/scientific text, molecular & protein 1D/3D data, cross-modal interleaved corpora)
  - **Instruction Tuning** on a comprehensive suite of 80 downstream tasks across 6 categories
- **Context Length**: 8,192 tokens
- **Tokenizer**: Extended Qwen3 vocabulary with:
  - 11,294 joint molecular 3D tokens (composed from SELFIES atom × MolStrucTok codes)
  - 4,096 protein 3D tokens (GCP-VQVAE codebook)
  - 26 protein 1D tokens (amino acids + non-standard/unknown)
  - SELFIES atom tokens and modality-specific control tokens

## Pretraining Corpus (304.4B tokens)

| Category | Tokens | Sources |
|----------|--------|---------|
| **Text** | 105.3B | FineWeb-Edu, FineFineWeb (biology/chemistry/medical/health), PubMed Full Articles |
| **Molecule** | 73.7B | PubChem, PCQM4Mv2, PubChemQC, MolTextNet |
| **Protein** | 77.4B | UniRef50, RCSB PDB, Swiss-Prot, TrEMBL, AlphaFold DB |
| **Cross-entity** | 48.0B | Interleaved text (PubMed, bioRxiv, S2ORC, USPTO), Molecule–protein (BindingDB, STITCH, jglaser, CrossDocked), Protein–protein (AlphaSeq, PPIRef) |

## Performance Highlights

BioMatrix achieves **state-of-the-art or competitive performance on 77 out of 80 tasks**. Selected highlights for the 4B-SFT variant:

### Molecular Tasks
- **Unconditional 1D Generation** (GuacaMol): 0.998 validity, 1.000 uniqueness, 0.986 novelty
- **Name Conversion (I2S EM)**: 92.83% (vs. SciReasoner-8B: 84.40%)
- **Text-Based Molecule Generation (EM)**: 65.07% (vs. SciReasoner-8B: 48.00%)
- **MoleculeQA Total Accuracy**: 73.78% (vs. prior best MolCA-1.3B: 64.79%)
- **Property-Conditioned 3D Generation**: ~3-4× error reduction on QM9 electronic-structure targets

### Protein Tasks
- **Fold Type Prediction (Family level)**: 87.25% accuracy
- **Annotation Prediction (UniProtSeq Keywords F1)**: 91.26%
- **Inverse Folding AAR**: 75.50% (vs. DPLM-2-3B: 61.67%)
- **Sequence–Structure Co-generation**: scTM = 0.965, scRMSD = 2.80
- **Unconditional Backbone Generation**: scTM = 0.963 (joint frontier with RFDiffusion)

### Interaction Tasks
- **BindingDB Affinity (RMSE)**: 1.030 (new SOTA, surpasses prior literature SOTA of 1.340)
- **PDBBindv2020 3D Affinity**: RMSE = 1.260, Pearson = 0.737, MAE = 0.972

## Quick Start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "QizhiPei/BioMatrix-4B-SFT"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

# Example: Molecule captioning with SELFIES input
instruction = "I need a brief explanation of the molecule denoted in this SELFIES notation. <|mol_sfi_start|>[Te]<|mol_sfi_end|>"

messages = [
    {"role": "user", "content": instruction}
]
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=False)
print(response)
```

## Modality Wrapping

When constructing prompts, biomolecular content must be wrapped with the corresponding control tokens:

| Modality | Wrapping Example |
|----------|------------------|
| Molecule SMILES | `<\|mol_smi_start\|>CC#CC#N<\|mol_smi_end\|>` |
| Molecule SELFIES | `<\|mol_sfi_start\|>[C][#C][C][#N]<\|mol_sfi_end\|>` |
| Molecule 3D | `<\|mol_3d_start\|>[H 3][C 0][#C 6]...<\|mol_3d_end\|>` |
| Protein 1D | `<\|prot_aa_start\|><A M><A R><A A>...<\|prot_aa_end\|>` |
| Protein 3D | `<\|prot_3d_start\|><S 4012><S 153><S 2091>...<\|prot_3d_end\|>` |

Natural language text is left unwrapped and serves as the default carrier modality.

## Supported Tasks

BioMatrix-4B-SFT was instruction-tuned across the following task categories:

**Molecule (1D)**: unconditional generation, name conversion, property prediction, captioning, text-based generation, forward/retrosynthesis, editing, optimization, customized generation, question answering

**Molecule (3D)**: unconditional generation, property-conditioned generation

**Protein (1D)**: sequence understanding, annotation prediction, knowledge mining, text-based design, unconditional generation

**Protein (3D)**: structure understanding, folding, inverse folding, sequence-structure co-generation, unconditional backbone generation

**Interaction**: molecule-protein binding affinity (1D & 3D), protein-protein interaction

> **Note on task-group variants**: As detailed in the paper, the released SFT model is trained on the union of all sub-task corpora with mild oversampling for small-data tasks. For best performance on specific benchmarks, please refer to the paper's task-group-specific variants.

## SMILES vs. SELFIES

BioMatrix supports both notations as parallel 1D molecular representations. Empirically:

- **SELFIES** excels on tasks requiring validity-by-construction (unconditional generation, property optimization)
- **SMILES** excels on tasks requiring surface-level structural anchoring (customized generation with atom/bond/functional-group constraints, forward synthesis, retrosynthesis)

See Section 9.2 of the paper for detailed analysis.

## Limitations

- Molecular and protein 3D structures are tokenized in **disjoint geometric reference frames**, so the model cannot natively represent biomolecular complexes (e.g., docking poses).
- Heavy domain specialization may erode some general-purpose language capabilities of the underlying Qwen3 backbone.
- Coverage is limited to **small molecules and proteins**; nucleic acids, carbohydrates, and lipids are not currently supported.
- Fine-grained 3D geometry (e.g., bond lengths) shows residual quantization error from finite codebooks; a lightweight post-hoc force-field refinement (e.g., MMFF) closes most of this gap.

## Citation

If you find BioMatrix useful, please cite:

```bibtex
@article{pei2026biomatrix,
  title={BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language},
  author={Pei, Qizhi and Zhou, Zhimeng and Duan, Yi and Zhao, Yiyang and He, Liang and Hsieh, Chang-Yu and He, Conghui and Yan, Rui and Wu, Lijun},
  year={2026}
}
```

## License

This model is released under the Apache 2.0 license. The base model (Qwen3-4B-Base) is subject to its own license terms.