NVIDIA-Nemotron-3-Bio-tokenizer

Tokenizer based on NVIDIA Nemotron-3 with 5 biological modalities injected: single-cell transcriptomics (scRNA-seq), BEL pathways, protein sequences, DNA methylation, and biomedical text.

Overview

This tokenizer was created by:

Removing tokens for distant writing systems (Arabic, Cyrillic, CJK, Hangul, Devanagari, and 14 other scripts) from the original Nemotron-3
Removing long non-bio Latin tokens (diacritics-heavy and final-merge tokens) that are not needed for biomedical NLP
Injecting 44,943 HGNC gene symbols as single never-split tokens
Injecting ~300 special tokens for multi-modal bio input: modality markers, BEL DSL, protein amino acids, bin tokens, namespace prefixes

The vocab size is preserved at 131,072 (2^17), matching the original Nemotron-3 architecture.

Key Features

Feature	Detail
Vocab size	131,072 (unchanged from original)
Merges	199,108
BPE type	Byte-level (GPT-2 style)
HGNC genes	44,943 single tokens (100% single-token encoding)
Modalities	Text, scRNA-seq, BEL pathways, Proteins (FASTA), DNA methylation
Scripts kept	Latin + Greek
Base model	nvidia/Nemotron-3-Nano-3B-Instruct

Supported Modalities

1. Biomedical Text

Standard English biomedical text. Bio/medical terms preserved (immunohistochemistry, phosphorylation, chromatography, etc.). Greek letters preserved for formulas.

2. Single-cell Transcriptomics (scRNA-seq)

<SC> CELL 17
HGNC:MS4A1 EXPR_BIN_73
HGNC:CD79A EXPR_BIN_68
HGNC:CD74  EXPR_BIN_61
HGNC:LYZ   EXPR_BIN_5
</SC>

Each gene is a single token. Expression values are single bin tokens: EXPR_BIN_0 through EXPR_BIN_100 (101 bins).

3. BEL Pathways (Biological Expression Language)

<BEL>
p(HGNC:AKT1) directlyIncreases act(p(HGNC:MTOR))
p(HGNC:TP53) decreases bp(GO:"cell cycle")
</BEL>

BEL functions (p(, a(, bp(, etc.) and relations (increases, directlyIncreases, etc.) are single tokens.

4. Protein Sequences (FASTA-style)

<PROT>
<p>M <p>V <p>L <p>S <p>P <p>A <p>D <p>K
</PROT>

Amino acids use  prefix to avoid conflict with English letters. 20 standard + 7 extended amino acid tokens.

5. DNA Methylation

<METH>
HGNC:TP53 METH_BIN_82
HGNC:MLH1 METH_BIN_95
</METH>

Gene-centric mode: same HGNC tokens + single bin tokens: METH_BIN_0 through METH_BIN_100 (101 bins, beta-values 0-100).

Token Inventory

Modality Markers (12 tokens, special=True)

<TEXT> </TEXT> <SC> </SC> <BEL> </BEL> <PROT> </PROT> <METH> </METH> <SEP> <MASK>

Structural Tokens (8)

CELL SAMPLE META BATCH DONOR TISSUE MASK_GENE MASK_VALUE

Bin Tokens (202)

Expression bins: EXPR_BIN_0 through EXPR_BIN_100 (101 tokens for scRNA-seq quantile bins) Methylation bins: METH_BIN_0 through METH_BIN_100 (101 tokens for DNA methylation beta-values)

Namespace Prefixes (6)

HGNC: CHEBI: GO: GOBP: MESH: DO:

BEL DSL (43 tokens)

Functions: a( act( bp( sec( surf( complex( composite( deg( frag( fus( g( loc( m( ma( path( pop( p( pmod( rxn( r( tloc( var(

Relations: increases directlyIncreases decreases directlyDecreases association regulates positiveCorrelation negativeCorrelation causesNoChange transcribedTo translatedTo hasComponent hasComponents hasMember hasMembers hasActivity isA subProcessOf rateLimitingStepOf orthologous noCorrelation

Protein Amino Acids (27)

Standard: A through Y (20 tokens) Extended: X B Z U O * -

HGNC Genes (44,943)

Full HGNC "complete set" as HGNC:<SYMBOL> tokens. Examples: HGNC:TP53, HGNC:BRCA1, HGNC:EGFR, HGNC:MS4A1.

Source: https://www.genenames.org/download/

What Was Removed

Category	Tokens removed
Non-Latin/non-Greek scripts (Arabic, Cyrillic, CJK, Hangul, Devanagari, Hiragana, Armenian, Hebrew, Telugu, Bengali, Katakana, Thai, Kannada, Tamil, Georgian, Malayalam, Gujarati, Myanmar)	34,952
`<SPECIAL_N>` placeholder tokens	982
Final Latin tokens with diacritics (German, French, Spanish, Portuguese, Turkish, etc.)	5,705
Long non-bio ASCII Latin final tokens (len >= 11)	3,602
Mixed-script tokens	2
Total removed	45,243

Preserved: All bio/medical English terms (170 tokens like immunohistochemistry, phosphorylation, chromatography, mitochondrial, etc.), all Greek letters, all merge-component tokens (BPE chain integrity).

Usage

Python (tokenizers)

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")

# Single-cell example
text = "<SC> CELL 1 HGNC:TP53 EXPR_BIN_92 HGNC:BRCA1 EXPR_BIN_45 </SC>"
encoded = tokenizer.encode(text)
print(f"Tokens: {encoded.tokens}")
print(f"IDs: {encoded.ids}")
decoded = tokenizer.decode(encoded.ids)

Python (transformers)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("transhumanist-already-exists/NVIDIA-Nemotron-3-Bio-tokenizer")

text = "<SC> CELL 1 HGNC:TP53 EXPR_BIN_92 </SC>"
tokens = tokenizer(text, add_special_tokens=False)
print(tokens["input_ids"])

Validation

Test	Result
Vocab size = 131,072	PASS
IDs contiguous 0-131,071	PASS
Merge integrity (0 errors)	PASS
HGNC single-token encoding (100/100 sampled)	PASS
English text roundtrip	PASS
Greek letters roundtrip	PASS
Bio terms preserved	PASS
Non-Latin/non-Greek leak check (0 leaked)	PASS
scRNA format encode/decode	PASS
BEL format encode/decode	PASS
Protein format encode/decode	PASS
Methylation format encode/decode	PASS

Files

File	Size	Description
`tokenizer.json`	13 MB	Main tokenizer (HuggingFace JSON format)
`tokenizer_config.json`	440 B	Tokenizer config for transformers
`special_tokens_map.json`	251 B	Special tokens mapping
`merge_info.json`	2.6 MB	Full list of removed and added tokens

merge_info.json

Contains parallel lists of removed/added token IDs and migration metadata:

{
  "replaced_nemotron_total_tokens_count": 45243,
  "added_hgnc_gene_tokens_count": 44944,
  "added_bio_special_tokens_count": 299,
  "added_bin_tokens_count": 202,
  "replaced_nemotron_tokens_ids": [1692, 1702, ...],
  "bio_tokens_ids_in_bio_vocab": [86009, 86010, ...],
  "bin_token_migration": {
    "old_tokens_removed": ["EXPR", "BIN", "METH"],
    "new_tokens_added_count": 202,
    "expr_bin_range": "EXPR_BIN_0 to EXPR_BIN_100",
    "meth_bin_range": "METH_BIN_0 to METH_BIN_100"
  }
}

Technical Details

Unified Bin Tokens

Each bin value (0-100) is a single dedicated token, enabling 2-token gene-value encoding:

scRNA: HGNC:<GENE> EXPR_BIN_<0-100> (e.g., HGNC:TP53 EXPR_BIN_92 = 2 tokens)
Methylation: HGNC:<GENE> METH_BIN_<0-100> (e.g., HGNC:MLH1 METH_BIN_95 = 2 tokens)

This maximizes compression: each gene-value pair is exactly 2 tokens (gene + bin), compared to 4 tokens with the old multi-token format.

BPE Merge Chain Integrity

When removing tokens, all merges referencing removed tokens were also cleaned (70,335 orphaned merges removed). Only tokens that are not used as components in any merge (final/leaf tokens) were candidates for removal from the Latin set, preserving BPE chain integrity.

Embedding Layer Update

When using with a Nemotron model, you need to:

Replace the embedding and output projection layers
Initialize new bio token embeddings (45,243 tokens at IDs 86,009+)
Fine-tune / continually pre-train on bio corpora

Recommended initialization strategies:

Random init + targeted pre-training
NACHOS-style embedding initialization (see Kiulian et al., 2025)
Mean of subword embeddings from original tokenizer

References

Vocabulary design: Based on token_groups_blueprint.md multi-modal token design
HGNC gene nomenclature: genenames.org
BEL language specification: language.bel.bio
Protein tokenization approach: BioT5+ (Pei et al., 2024), ProtT5 (Elnaggar et al., 2022)
Base model: nvidia/Nemotron-3-Nano-3B-Instruct
Embedding initialization: Kiulian et al. (2025) "From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages"

License

Follows the base model license:

NVIDIA Nemotron: NVIDIA Open Model License

Created

2026-03-03

Author

Bogdan Didenko

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support