--- license: apache-2.0 language: - en tags: - biology - chemistry - molecule - protein - multimodal - foundation-model - pretrained pipeline_tag: text-generation base_model: Qwen/Qwen3-1.7B-Base library_name: transformers --- # BioMatrix-1.7B-Base **BioMatrix** is a multimodal biological foundation model that natively integrates **1D sequences**, **3D structures**, and **natural language** for both **molecules** and **proteins** within a single decoder-only architecture. This is the **1.7B-parameter Base model**, obtained via **multimodal continual pretraining** of Qwen3-1.7B-Base on 304.4 billion tokens spanning text, molecular and protein 1D/3D data, and cross-modal corpora. This base checkpoint is intended for further fine-tuning on downstream tasks. For an instruction-tuned model ready for inference, see [BioMatrix-1.7B-SFT](https://huggingface.co/QizhiPei/BioMatrix-1.7B-SFT). For a larger model, see [BioMatrix-4B-Base](https://huggingface.co/QizhiPei/BioMatrix-4B-Base). - πŸ“„ **Paper**: [BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language](https://github.com/QizhiPei/BioMatrix/blob/main/biomatrix_tech_report.pdf) - πŸ’» **Code**: [https://github.com/QizhiPei/BioMatrix](https://github.com/QizhiPei/BioMatrix) - πŸ€— **Model & Data Collection**: [https://huggingface.co/collections/QizhiPei/biomatrix](https://huggingface.co/collections/QizhiPei/biomatrix) ## Model Overview BioMatrix maps **all biological modalities into a shared discrete token space** via a unified tokenization scheme: - **Molecular 1D sequences** (both SMILES and SELFIES notations) - **Molecular 3D structures** (via MolStrucTok with branch-decoupled decoder) - **Protein 1D sequences** (residue-level tokens) - **Protein 3D structures** (via GCP-VQVAE backbone tokenizer) - **Natural language** (inherited from Qwen3 tokenizer) All modalities are consumed and produced uniformly under a **single next-token prediction objective**β€”without external encoders, projection adapters, or modality-specific output heads. | Model | Molecule 1D | Molecule 3D | Protein 1D | Protein 3D | Natural Language | |-------|:-----------:|:-----------:|:----------:|:----------:|:----------------:| | ESM3 | βœ— | βœ— | βœ“ | βœ“ | βœ“ | | 3D-MoLM | βœ“ | βœ“ | βœ— | βœ— | βœ“ | | AlphaFold3 | βœ“ | βœ“ | βœ“ | βœ“ | βœ— | | BioT5/BioT5+ | βœ“ | βœ— | βœ“ | βœ— | βœ“ | | BioMedGPT | βœ“ | βœ— | βœ“ | βœ— | βœ“ | | **BioMatrix** | **βœ“** | **βœ“** | **βœ“** | **βœ“** | **βœ“** | ## Model Details - **Base Architecture**: Qwen3-1.7B-Base - **Parameters**: 1.7B - **Training Stage**: Multimodal Continual Pretraining only (not instruction-tuned) - **Training Tokens**: 304.4B - **Context Length**: 8,192 tokens - **Tokenizer**: Extended Qwen3 vocabulary with: - 11,294 joint molecular 3D tokens (composed from SELFIES atom Γ— MolStrucTok codes) - 4,096 protein 3D tokens (GCP-VQVAE codebook) - 26 protein 1D tokens (amino acids + non-standard/unknown) - SELFIES atom tokens and modality-specific control tokens ### Embedding Initialization New vocabulary entries are initialized via a **description-based scheme**: each new token is grounded in the pretrained Qwen3 embedding space by averaging the embeddings of the subword tokens of a short natural-language description (e.g., `` β†’ "Tryptophan"), plus a small isotropic Gaussian perturbation to break symmetry. This provides a more stable starting point than random initialization. ## Pretraining Corpus (304.4B tokens) | Category | Tokens | Sources | |----------|--------|---------| | **Text** (105.3B) | General: 25.6B | FineWeb-Edu | | | Scientific: 79.7B | FineFineWeb (biology/chemistry/medical/health), PubMed Full Articles | | **Molecule** (73.7B) | 1D: 36.0B | PubChem, MolTextNet | | | 3D: 17.6B | PubChem, PCQM4Mv2, PubChemQC | | | Other: 24.0B | (text descriptions, properties, IUPAC names) | | **Protein** (77.4B) | 1D: 17.1B | UniRef50 | | | 3D: 38.5B | RCSB PDB, AlphaFold DB | | | Other: 19.5B | Swiss-Prot, TrEMBL annotations | | | Other (additional): 2.9B | | | **Cross-entity** (48.0B) | Interleaved Text: 17.1B | PubMed, bioRxiv, S2ORC, USPTO | | | 3D: 11.4B | CrossDocked, PPIRef | | | Other: 19.5B | BindingDB, STITCH, jglaser, AlphaSeq | ### Training Configuration - **Framework**: LLaMA-Factory - **Hardware**: 64 NVIDIA H100 GPUs - **Global Batch Size**: 1,024 - **Maximum Sequence Length**: 8,192 tokens - **Optimizer**: AdamW - **Peak Learning Rate**: 2.0 Γ— 10⁻⁴ (cosine schedule) - **Warmup Steps**: 2,000 - **Total Steps**: ~36.4K (1 epoch over the full 304.4B-token corpus) ## Intended Use This **Base model is not instruction-tuned**. It is suitable for: - **Further fine-tuning** on custom biological tasks - **Continued pretraining** on domain-specific corpora - **Research on representation learning** across biomolecular modalities - **Embedding extraction** for downstream classification/regression tasks For ready-to-use instruction-following capabilities (e.g., molecule captioning, protein design, property prediction), please use the [SFT variant](https://huggingface.co/QizhiPei/BioMatrix-1.7B-SFT). ## Quick Start ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "QizhiPei/BioMatrix-1.7B-Base" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype="auto", device_map="auto", trust_remote_code=True ) # Example: Continue a SMILES sequence prompt = "<|mol_smi_start|>CC(=O)" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=512) print(tokenizer.decode(outputs[0], skip_special_tokens=False)) ``` ## Modality Wrapping When constructing inputs, biomolecular content must be wrapped with the corresponding control tokens: | Modality | Wrapping Example | |----------|------------------| | Molecule SMILES | `<\|mol_smi_start\|>CC#CC#N<\|mol_smi_end\|>` | | Molecule SELFIES | `<\|mol_sfi_start\|>[C][#C][C][#N]<\|mol_sfi_end\|>` | | Molecule 3D | `<\|mol_3d_start\|>[H 3][C 0][#C 6]...<\|mol_3d_end\|>` | | Protein 1D | `<\|prot_aa_start\|>...<\|prot_aa_end\|>` | | Protein 3D | `<\|prot_3d_start\|>...<\|prot_3d_end\|>` | Natural language text is left unwrapped and serves as the default carrier modality. ## Limitations - This model is **not instruction-tuned** and is unlikely to follow natural-language instructions out-of-the-box. Use the SFT variant for instruction-following. - Molecular and protein 3D structures are tokenized in **disjoint geometric reference frames**, so the model cannot natively represent biomolecular complexes (e.g., docking poses). - Heavy domain specialization may erode some general-purpose language capabilities of the underlying Qwen3 backbone. - Coverage is limited to **small molecules and proteins**; nucleic acids, carbohydrates, and lipids are not currently supported. ## Citation If you find BioMatrix useful, please cite: ```bibtex @article{pei2026biomatrix, title={BioMatrix: Towards a Comprehensive Biological Foundation Model Spanning the Modality Matrix of Sequences, Structures, and Language}, author={Pei, Qizhi and Zhou, Zhimeng and Duan, Yi and Zhao, Yiyang and He, Liang and Hsieh, Chang-Yu and He, Conghui and Yan, Rui and Wu, Lijun}, year={2026} } ``` ## License This model is released under the Apache 2.0 license. The base model (Qwen3-1.7B-Base) is subject to its own license terms.