File size: 7,481 Bytes
cd73a83
 
 
 
 
 
 
 
 
 
 
 
 
128fcba
cd73a83
 
 
 
 
 
 
128fcba
cd73a83
b50cc39
 
cd73a83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128fcba
cd73a83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
075c8cc
cd73a83
 
 
 
 
 
 
 
 
 
 
548c92b
 
 
cd73a83
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128fcba
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
license: apache-2.0
language:
- en
tags:
- biology
- chemistry
- molecule
- protein
- multimodal
- foundation-model
- pretrained
pipeline_tag: text-generation
base_model: Qwen/Qwen3-4B-Base
library_name: transformers
---

# BioMatrix-4B-Base

**BioMatrix** is a multimodal biological foundation model that natively integrates **1D sequences**, **3D structures**, and **natural language** for both **molecules** and **proteins** within a single decoder-only architecture.

This is the **4B-parameter Base model**, obtained via **multimodal continual pretraining** of Qwen3-4B-Base on 304.4 billion tokens spanning text, molecular and protein 1D/3D data, and cross-modal corpora. This base checkpoint is intended for further fine-tuning on downstream tasks. For an instruction-tuned model ready for inference, see [BioMatrix-4B-SFT](https://huggingface.co/QizhiPei/BioMatrix-4B-SFT).

- πŸ“„ **Paper**: [BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language](https://github.com/QizhiPei/BioMatrix/blob/main/biomatrix_tech_report.pdf)
- πŸ’» **Code**: [https://github.com/QizhiPei/BioMatrix](https://github.com/QizhiPei/BioMatrix)
- πŸ€— **Model & Data Collection**: [https://huggingface.co/collections/QizhiPei/biomatrix](https://huggingface.co/collections/QizhiPei/biomatrix)

## Model Overview

BioMatrix maps **all biological modalities into a shared discrete token space** via a unified tokenization scheme:

- **Molecular 1D sequences** (both SMILES and SELFIES notations)
- **Molecular 3D structures** (via MolStrucTok with branch-decoupled decoder)
- **Protein 1D sequences** (residue-level tokens)
- **Protein 3D structures** (via GCP-VQVAE backbone tokenizer)
- **Natural language** (inherited from Qwen3 tokenizer)

All modalities are consumed and produced uniformly under a **single next-token prediction objective**β€”without external encoders, projection adapters, or modality-specific output heads.

| Model | Molecule 1D | Molecule 3D | Protein 1D | Protein 3D | Natural Language |
|-------|:-----------:|:-----------:|:----------:|:----------:|:----------------:|
| ESM3 | βœ— | βœ— | βœ“ | βœ“ | βœ“ |
| 3D-MoLM | βœ“ | βœ“ | βœ— | βœ— | βœ“ |
| AlphaFold3 | βœ“ | βœ“ | βœ“ | βœ“ | βœ— |
| BioT5/BioT5+ | βœ“ | βœ— | βœ“ | βœ— | βœ“ |
| BioMedGPT | βœ“ | βœ— | βœ“ | βœ— | βœ“ |
| **BioMatrix** | **βœ“** | **βœ“** | **βœ“** | **βœ“** | **βœ“** |

## Model Details

- **Base Architecture**: Qwen3-4B-Base
- **Parameters**: 4B
- **Training Stage**: Multimodal Continual Pretraining only (not instruction-tuned)
- **Training Tokens**: 304.4B
- **Context Length**: 8,192 tokens
- **Tokenizer**: Extended Qwen3 vocabulary with:
  - 11,294 joint molecular 3D tokens (composed from SELFIES atom Γ— MolStrucTok codes)
  - 4,096 protein 3D tokens (GCP-VQVAE codebook)
  - 26 protein 1D tokens (amino acids + non-standard/unknown)
  - SELFIES atom tokens and modality-specific control tokens

### Embedding Initialization

New vocabulary entries are initialized via a **description-based scheme**: each new token is grounded in the pretrained Qwen3 embedding space by averaging the embeddings of the subword tokens of a short natural-language description (e.g., `<A_W>` β†’ "Tryptophan"), plus a small isotropic Gaussian perturbation to break symmetry. This provides a more stable starting point than random initialization.

## Pretraining Corpus (304.4B tokens)

| Category | Tokens | Sources |
|----------|--------|---------|
| **Text** (105.3B) | General: 25.6B | FineWeb-Edu |
| | Scientific: 79.7B | FineFineWeb (biology/chemistry/medical/health), PubMed Full Articles |
| **Molecule** (73.7B) | 1D: 36.0B | PubChem, MolTextNet |
| | 3D: 17.6B | PubChem, PCQM4Mv2, PubChemQC |
| | Other: 24.0B | (text descriptions, properties, IUPAC names) |
| **Protein** (77.4B) | 1D: 17.1B | UniRef50 |
| | 3D: 38.5B | RCSB PDB, AlphaFold DB |
| | Other: 19.5B | Swiss-Prot, TrEMBL annotations |
| | Other (additional): 2.9B | |
| **Cross-entity** (48.0B) | Interleaved Text: 17.1B | PubMed, bioRxiv, S2ORC, USPTO |
| | 3D: 11.4B | CrossDocked, PPIRef |
| | Other: 19.5B | BindingDB, STITCH, jglaser, AlphaSeq |

### Training Configuration

- **Framework**: LLaMA-Factory
- **Hardware**: 64 NVIDIA H100 GPUs
- **Global Batch Size**: 1,024
- **Maximum Sequence Length**: 8,192 tokens
- **Optimizer**: AdamW
- **Peak Learning Rate**: 2.0 Γ— 10⁻⁴ (cosine schedule)
- **Warmup Steps**: 2,000
- **Total Steps**: ~36.4K (1 epoch over the full 304.4B-token corpus)

## Intended Use

This **Base model is not instruction-tuned**. It is suitable for:

- **Further fine-tuning** on custom biological tasks
- **Continued pretraining** on domain-specific corpora
- **Research on representation learning** across biomolecular modalities
- **Embedding extraction** for downstream classification/regression tasks

For ready-to-use instruction-following capabilities (e.g., molecule captioning, protein design, property prediction), please use the [SFT variant](https://huggingface.co/QizhiPei/BioMatrix-4B-SFT).

## Quick Start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "QizhiPei/BioMatrix-4B-Base"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)

# Example: Continue a SMILES sequence
prompt = "<|mol_smi_start|>CC(=O)"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
```

## Modality Wrapping

When constructing inputs, biomolecular content must be wrapped with the corresponding control tokens:

| Modality | Wrapping Example |
|----------|------------------|
| Molecule SMILES | `<\|mol_smi_start\|>CC#CC#N<\|mol_smi_end\|>` |
| Molecule SELFIES | `<\|mol_sfi_start\|>[C][#C][C][#N]<\|mol_sfi_end\|>` |
| Molecule 3D | `<\|mol_3d_start\|>[H 3][C 0][#C 6]...<\|mol_3d_end\|>` |
| Protein 1D | `<\|prot_aa_start\|><A M><A R><A A>...<\|prot_aa_end\|>` |
| Protein 3D | `<\|prot_3d_start\|><S 4012><S 153><S 2091>...<\|prot_3d_end\|>` |

Natural language text is left unwrapped and serves as the default carrier modality.

## Limitations

- This model is **not instruction-tuned** and is unlikely to follow natural-language instructions out-of-the-box. Use the SFT variant for instruction-following.
- Molecular and protein 3D structures are tokenized in **disjoint geometric reference frames**, so the model cannot natively represent biomolecular complexes (e.g., docking poses).
- Heavy domain specialization may erode some general-purpose language capabilities of the underlying Qwen3 backbone.
- Coverage is limited to **small molecules and proteins**; nucleic acids, carbohydrates, and lipids are not currently supported.

## Citation

If you find BioMatrix useful, please cite:

```bibtex
@article{pei2026biomatrix,
  title={BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language},
  author={Pei, Qizhi and Zhou, Zhimeng and Duan, Yi and Zhao, Yiyang and He, Liang and Hsieh, Chang-Yu and He, Conghui and Yan, Rui and Wu, Lijun},
  year={2026}
}
```

## License

This model is released under the Apache 2.0 license. The base model (Qwen3-4B-Base) is subject to its own license terms.