NVIDIA-Nemotron-3-Bio-tokenizer
Tokenizer based on NVIDIA Nemotron-3 with 5 biological modalities injected: single-cell transcriptomics (scRNA-seq), BEL pathways, protein sequences, DNA methylation, and biomedical text.
Overview
This tokenizer was created by:
- Removing tokens for distant writing systems (Arabic, Cyrillic, CJK, Hangul, Devanagari, and 14 other scripts) from the original Nemotron-3
- Removing long non-bio Latin tokens (diacritics-heavy and final-merge tokens) that are not needed for biomedical NLP
- Injecting 44,943 HGNC gene symbols as single never-split tokens
- Injecting ~300 special tokens for multi-modal bio input: modality markers, BEL DSL, protein amino acids, bin tokens, namespace prefixes
The vocab size is preserved at 131,072 (2^17), matching the original Nemotron-3 architecture.
Key Features
| Feature | Detail |
|---|---|
| Vocab size | 131,072 (unchanged from original) |
| Merges | 199,108 |
| BPE type | Byte-level (GPT-2 style) |
| HGNC genes | 44,943 single tokens (100% single-token encoding) |
| Modalities | Text, scRNA-seq, BEL pathways, Proteins (FASTA), DNA methylation |
| Scripts kept | Latin + Greek |
| Base model | nvidia/Nemotron-3-Nano-3B-Instruct |
Supported Modalities
1. Biomedical Text
Standard English biomedical text. Bio/medical terms preserved (immunohistochemistry, phosphorylation, chromatography, etc.). Greek letters preserved for formulas.
2. Single-cell Transcriptomics (scRNA-seq)
<SC> CELL 17
HGNC:MS4A1 EXPR_BIN_73
HGNC:CD79A EXPR_BIN_68
HGNC:CD74 EXPR_BIN_61
HGNC:LYZ EXPR_BIN_5
</SC>
Each gene is a single token. Expression values are single bin tokens: EXPR_BIN_0 through EXPR_BIN_100 (101 bins).
3. BEL Pathways (Biological Expression Language)
<BEL>
p(HGNC:AKT1) directlyIncreases act(p(HGNC:MTOR))
p(HGNC:TP53) decreases bp(GO:"cell cycle")
</BEL>
BEL functions (p(, a(, bp(, etc.) and relations (increases, directlyIncreases, etc.) are single tokens.
4. Protein Sequences (FASTA-style)
<PROT>
<p>M <p>V <p>L <p>S <p>P <p>A <p>D <p>K
</PROT>
Amino acids use <p> prefix to avoid conflict with English letters. 20 standard + 7 extended amino acid tokens.
5. DNA Methylation
<METH>
HGNC:TP53 METH_BIN_82
HGNC:MLH1 METH_BIN_95
</METH>
Gene-centric mode: same HGNC tokens + single bin tokens: METH_BIN_0 through METH_BIN_100 (101 bins, beta-values 0-100).
Token Inventory
Modality Markers (12 tokens, special=True)
<TEXT> </TEXT> <SC> </SC> <BEL> </BEL> <PROT> </PROT> <METH> </METH> <SEP> <MASK>
Structural Tokens (8)
CELL SAMPLE META BATCH DONOR TISSUE MASK_GENE MASK_VALUE
Bin Tokens (202)
Expression bins: EXPR_BIN_0 through EXPR_BIN_100 (101 tokens for scRNA-seq quantile bins)
Methylation bins: METH_BIN_0 through METH_BIN_100 (101 tokens for DNA methylation beta-values)
Namespace Prefixes (6)
HGNC: CHEBI: GO: GOBP: MESH: DO:
BEL DSL (43 tokens)
Functions: a( act( bp( sec( surf( complex( composite( deg( frag( fus( g( loc( m( ma( path( pop( p( pmod( rxn( r( tloc( var(
Relations: increases directlyIncreases decreases directlyDecreases association regulates positiveCorrelation negativeCorrelation causesNoChange transcribedTo translatedTo hasComponent hasComponents hasMember hasMembers hasActivity isA subProcessOf rateLimitingStepOf orthologous noCorrelation
Protein Amino Acids (27)
Standard: <p>A through <p>Y (20 tokens)
Extended: <p>X <p>B <p>Z <p>U <p>O <p>* <p>-
HGNC Genes (44,943)
Full HGNC "complete set" as HGNC:<SYMBOL> tokens. Examples: HGNC:TP53, HGNC:BRCA1, HGNC:EGFR, HGNC:MS4A1.
Source: https://www.genenames.org/download/
What Was Removed
| Category | Tokens removed |
|---|---|
| Non-Latin/non-Greek scripts (Arabic, Cyrillic, CJK, Hangul, Devanagari, Hiragana, Armenian, Hebrew, Telugu, Bengali, Katakana, Thai, Kannada, Tamil, Georgian, Malayalam, Gujarati, Myanmar) | 34,952 |
<SPECIAL_N> placeholder tokens |
982 |
| Final Latin tokens with diacritics (German, French, Spanish, Portuguese, Turkish, etc.) | 5,705 |
| Long non-bio ASCII Latin final tokens (len >= 11) | 3,602 |
| Mixed-script tokens | 2 |
| Total removed | 45,243 |
Preserved: All bio/medical English terms (170 tokens like immunohistochemistry, phosphorylation, chromatography, mitochondrial, etc.), all Greek letters, all merge-component tokens (BPE chain integrity).
Usage
Python (tokenizers)
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
# Single-cell example
text = "<SC> CELL 1 HGNC:TP53 EXPR_BIN_92 HGNC:BRCA1 EXPR_BIN_45 </SC>"
encoded = tokenizer.encode(text)
print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")
decoded = tokenizer.decode(encoded.ids)
Python (transformers)
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("transhumanist-already-exists/NVIDIA-Nemotron-3-Bio-tokenizer")
text = "<SC> CELL 1 HGNC:TP53 EXPR_BIN_92 </SC>"
tokens = tokenizer(text, add_special_tokens=False)
print(tokens["input_ids"])
Validation
| Test | Result |
|---|---|
| Vocab size = 131,072 | PASS |
| IDs contiguous 0-131,071 | PASS |
| Merge integrity (0 errors) | PASS |
| HGNC single-token encoding (100/100 sampled) | PASS |
| English text roundtrip | PASS |
| Greek letters roundtrip | PASS |
| Bio terms preserved | PASS |
| Non-Latin/non-Greek leak check (0 leaked) | PASS |
| scRNA format encode/decode | PASS |
| BEL format encode/decode | PASS |
| Protein format encode/decode | PASS |
| Methylation format encode/decode | PASS |
Files
| File | Size | Description |
|---|---|---|
tokenizer.json |
13 MB | Main tokenizer (HuggingFace JSON format) |
tokenizer_config.json |
440 B | Tokenizer config for transformers |
special_tokens_map.json |
251 B | Special tokens mapping |
merge_info.json |
2.6 MB | Full list of removed and added tokens |
merge_info.json
Contains parallel lists of removed/added token IDs and migration metadata:
{
"replaced_nemotron_total_tokens_count": 45243,
"added_hgnc_gene_tokens_count": 44944,
"added_bio_special_tokens_count": 299,
"added_bin_tokens_count": 202,
"replaced_nemotron_tokens_ids": [1692, 1702, ...],
"bio_tokens_ids_in_bio_vocab": [86009, 86010, ...],
"bin_token_migration": {
"old_tokens_removed": ["EXPR", "BIN", "METH"],
"new_tokens_added_count": 202,
"expr_bin_range": "EXPR_BIN_0 to EXPR_BIN_100",
"meth_bin_range": "METH_BIN_0 to METH_BIN_100"
}
}
Technical Details
Unified Bin Tokens
Each bin value (0-100) is a single dedicated token, enabling 2-token gene-value encoding:
- scRNA:
HGNC:<GENE> EXPR_BIN_<0-100>(e.g.,HGNC:TP53 EXPR_BIN_92= 2 tokens) - Methylation:
HGNC:<GENE> METH_BIN_<0-100>(e.g.,HGNC:MLH1 METH_BIN_95= 2 tokens)
This maximizes compression: each gene-value pair is exactly 2 tokens (gene + bin), compared to 4 tokens with the old multi-token format.
BPE Merge Chain Integrity
When removing tokens, all merges referencing removed tokens were also cleaned (70,335 orphaned merges removed). Only tokens that are not used as components in any merge (final/leaf tokens) were candidates for removal from the Latin set, preserving BPE chain integrity.
Embedding Layer Update
When using with a Nemotron model, you need to:
- Replace the embedding and output projection layers
- Initialize new bio token embeddings (45,243 tokens at IDs 86,009+)
- Fine-tune / continually pre-train on bio corpora
Recommended initialization strategies:
- Random init + targeted pre-training
- NACHOS-style embedding initialization (see Kiulian et al., 2025)
- Mean of subword embeddings from original tokenizer
References
- Vocabulary design: Based on token_groups_blueprint.md multi-modal token design
- HGNC gene nomenclature: genenames.org
- BEL language specification: language.bel.bio
- Protein tokenization approach: BioT5+ (Pei et al., 2024), ProtT5 (Elnaggar et al., 2022)
- Base model: nvidia/Nemotron-3-Nano-3B-Instruct
- Embedding initialization: Kiulian et al. (2025) "From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages"
License
Follows the base model license:
- NVIDIA Nemotron: NVIDIA Open Model License
Created
2026-03-03
Author
Bogdan Didenko