DeepTaxa: Hierarchical 16S rRNA Taxonomy Classification
DeepTaxa is a deep learning model for hierarchical taxonomy classification of 16S rRNA gene sequences. The architecture couples a convolutional branch, which captures local k-mer motifs, with a BERT-style transformer, which captures long-range context. Both branches operate over tokens produced by the DNABERT-2 byte-pair encoder. Predictions are generated jointly for all seven standard taxonomic ranks: domain, phylum, class, order, family, genus, and species.
Two checkpoints are released here: one trained on full-length 16S sequences, and one trained on V3-V4 amplicons. They share the same architecture and differ only in classifier head size.
Checkpoint selection
| Sequencing protocol | Recommended checkpoint | File |
|---|---|---|
| Sanger 27F/1492R, PacBio HiFi 16S, Oxford Nanopore long-read 16S, full-length reference lookup | Full-length v1 | deeptaxa-full-length-v1.pt |
| Illumina paired-end V3-V4 with 341F/805R primers | V3-V4 v1 | deeptaxa-v3v4-v1.pt |
Released checkpoints
| Checkpoint | Training data | Species Acc | Species F1 | Species ECE | Params | File size |
|---|---|---|---|---|---|---|
| Full-length v1 | 277,336 full-length 16S sequences (approximately 1,500 bp) from Greengenes2 | 92.97% | 92.00% | 0.0268 | 112.3 M | 450 MB |
| V3-V4 v1 | 273,003 in-silico V3-V4 extractions (approximately 420 bp) from Greengenes2 | 87.52% | 85.79% | 0.0347 | 99.5 M | 447 MB |
Both checkpoints are inference-only. Optimizer and scheduler state have been removed to reduce file size; resuming training from these checkpoints is not supported.
Both checkpoints use the same Hybrid CNN-BERT architecture. The V3-V4 checkpoint is approximately 11% smaller because roughly half of Greengenes2 species cannot be recovered as V3-V4 amplicons and are therefore absent from its classifier heads.
Architecture
The architecture hyperparameters were selected via Optuna search on full-length data and applied unchanged to the V3-V4 checkpoint.
| Component | Value |
|---|---|
tokenizer_name |
zhihan1996/DNABERT-2-117M |
max_length |
512 (tokens) |
embed_dim |
896 |
num_filters |
512 |
kernel_sizes |
[5, 7, 9] |
num_conv_layers |
1 |
hidden_size |
1024 |
num_hidden_layers |
5 |
num_attention_heads |
8 |
intermediate_size |
4096 |
hidden_dropout_prob |
0.1744 |
Test-set performance
Both checkpoints were evaluated on their respective held-out Greengenes2 2024.09 test splits.
| Rank | Full-length Acc | Full-length F1 | V3-V4 Acc | V3-V4 F1 |
|---|---|---|---|---|
| Domain | 99.98% | 99.98% | 99.98% | 99.98% |
| Phylum | 99.70% | 99.69% | 99.70% | 99.68% |
| Class | 99.63% | 99.59% | 99.65% | 99.61% |
| Order | 99.05% | 98.95% | 98.95% | 98.84% |
| Family | 98.63% | 98.43% | 98.43% | 98.21% |
| Genus | 96.88% | 96.43% | 95.27% | 94.73% |
| Species | 92.97% | 92.00% | 87.52% | 85.79% |
Training configuration
| Parameter | Full-length v1 | V3-V4 v1 |
|---|---|---|
| Training data | Greengenes2 2024.09 training set (277,336 full-length sequences, approximately 1,500 bp) | In-silico V3-V4 extractions from the same training set (273,003 amplicons) |
| Test data | Greengenes2 2024.09 test split (69,335 full-length sequences) | V3-V4 extractions from the test split (68,282 amplicons) |
| Extraction primers | N/A | 341F CCTACGGGNGGCWGCAG and 805R GACTACHVGGGTATCTAATCC |
| Label space (species) | 16,909 | 8,347 |
| Label space (domain / phylum / class / order / family / genus) | 2 / 129 / 349 / 997 / 2,250 / 7,287 | 2 / 115 / 270 / 709 / 1,528 / 4,529 |
| Classifier head parameters | 28.6 M | 15.9 M |
| Total parameters | 112,253,717 | 99,520,142 |
| Learning rate | 3.72e-4 | 3.72e-4 |
| Batch size | 32 | 32 |
| Weight decay | 4.18e-2 | 4.18e-2 |
| Epochs | 10 | 10 |
| Loss | Cross-entropy with uniform per-level weights | Cross-entropy with uniform per-level weights |
| Optimizer | AdamW (beta1 = 0.9, beta2 = 0.999) | AdamW (beta1 = 0.9, beta2 = 0.999) |
| Learning rate schedule | Linear warm-up over 10% of steps, followed by linear decay | Linear warm-up over 10% of steps, followed by linear decay |
| Seed | 42 | 42 |
| Hardware | NVIDIA GeForce RTX 4090 | NVIDIA GeForce RTX 4090 |
Usage
Download
# Full-length checkpoint
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa-full-length-v1.pt
# V3-V4 checkpoint
wget https://huggingface.co/systems-genomics-lab/deeptaxa/resolve/main/deeptaxa-v3v4-v1.pt
# Or clone the full repository
git clone https://huggingface.co/systems-genomics-lab/deeptaxa
Python API with huggingface_hub:
from huggingface_hub import hf_hub_download
# Full-length
full_length_ckpt = hf_hub_download(
repo_id="systems-genomics-lab/deeptaxa",
filename="deeptaxa-full-length-v1.pt",
)
# V3-V4
v3v4_ckpt = hf_hub_download(
repo_id="systems-genomics-lab/deeptaxa",
filename="deeptaxa-v3v4-v1.pt",
)
Install DeepTaxa and run predictions
pip install git+https://github.com/systems-genomics-lab/deeptaxa.git
# Full-length sequences (Sanger, PacBio HiFi, Oxford Nanopore)
deeptaxa predict \
--fasta-file your_full_length_16s.fna.gz \
--checkpoint deeptaxa-full-length-v1.pt \
--output-dir predictions/
# V3-V4 amplicons (Illumina, already demultiplexed and primer-trimmed)
deeptaxa predict \
--fasta-file your_v3v4_amplicons.fna.gz \
--checkpoint deeptaxa-v3v4-v1.pt \
--output-dir predictions/
Input preparation for V3-V4 amplicons: the input FASTA file should contain V3-V4 sequences that have already been demultiplexed and primer-trimmed by an upstream tool such as DADA2, cutadapt, or QIIME2. The V3-V4 checkpoint was trained on in-silico primer extractions, which approximate merged paired-end amplicons. Paired-end reads should therefore be merged into consensus amplicons prior to prediction, or the forward read alone may be provided.
Full usage documentation and analysis notebooks are available in the GitHub repository.
Limitations
Limitations that apply to both checkpoints:
- Approximately 44.8% of Greengenes2 species have only a single training example, which limits reliable prediction for those classes.
- The label space corresponds to Greengenes2 2024.09. Predictions are produced against the exact Greengenes2 hierarchy, and species absent from the training data cannot be predicted. Adapting the model to a different reference database, such as SILVA or GTDB, would require retraining.
- A GPU is strongly recommended; CPU inference is impractical for large datasets.
Limitations specific to the full-length checkpoint:
- Best performance is obtained on sequences of at least 1,200 bp; shorter amplicons should be classified using the V3-V4 checkpoint.
- Species-level accuracy plateaus near 93%.
Limitations specific to the V3-V4 checkpoint:
- Species-level accuracy plateaus near 87.5%. The approximately 420 bp V3-V4 region carries less taxonomic information than the full 16S gene.
- The label space contains 8,347 species. Those for which no V3-V4 amplicon could be extracted during training are absent and cannot be predicted.
- Primer specificity: the model was trained on 341F/805R extractions. Sequences amplified with other V3-V4 primers, such as 357F or 338F, or with substantially different region boundaries may yield degraded predictions.
Citation
@software{DeepTaxa,
author = {{Systems Genomics Lab}},
title = {DeepTaxa: Hierarchical Taxonomy Classification of 16S rRNA Sequences with Deep Learning},
year = {2026},
publisher = {GitHub},
url = {https://github.com/systems-genomics-lab/deeptaxa}
}
References
- Akiba, T., Sano, S., Yanase, T., et al. (2019). Optuna: A next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2623-2631. DOI: 10.1145/3292500.3330701
- Bolyen, E., Rideout, J.R., Dillon, M.R., et al. (2019). Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature Biotechnology, 37(8), 852-857. DOI: 10.1038/s41587-019-0209-9
- Callahan, B.J., McMurdie, P.J., Rosen, M.J., et al. (2016). DADA2: High-resolution sample inference from Illumina amplicon data. Nature Methods, 13(7), 581-583. DOI: 10.1038/nmeth.3869
- Martin, M. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal, 17(1), 10-12. DOI: 10.14806/ej.17.1.200
- McDonald, D., Jiang, Y., Balaban, M., et al. (2024). Greengenes2 unifies microbial data in a single reference tree. Nature Biotechnology, 42(5), 715-718. DOI: 10.1038/s41587-023-01845-1
- Parks, D.H., Chuvochina, M., Chaumeil, P.A., et al. (2020). A complete domain-to-species taxonomy for Bacteria and Archaea. Nature Biotechnology, 38(9), 1079-1086. DOI: 10.1038/s41587-020-0501-8
- Quast, C., Pruesse, E., Yilmaz, P., et al. (2013). The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Research, 41(D1), D590-D596. DOI: 10.1093/nar/gks1219
- Zhou, Z., Ji, Y., Li, W., et al. (2024). DNABERT-2: Efficient foundation model and benchmark for multi-species genomes. International Conference on Learning Representations. arXiv:2306.15006
Contact
For support, please open an issue on the GitHub repository.
Acknowledgments
- Hugging Face, for hosting datasets and models.
- The High-Performance Computing Team of the School of Sciences and Engineering at the American University in Cairo, for granting access to the GPU resources used in training.
Version history
v1 (April 2026). Initial release of the full-length and V3-V4 checkpoints.
- Downloads last month
- 246
Model tree for systems-genomics-lab/deeptaxa
Base model
zhihan1996/DNABERT-2-117MDataset used to train systems-genomics-lab/deeptaxa
Paper for systems-genomics-lab/deeptaxa
Evaluation results
- Domain Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)test set self-reported1.000
- Phylum Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)test set self-reported0.997
- Class Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)test set self-reported0.996
- Order Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)test set self-reported0.991
- Family Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)test set self-reported0.986
- Genus Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)test set self-reported0.969
- Species Accuracy on Greengenes2 (2024-09 Test Split, full-length 16S)test set self-reported0.930
- Species F1 (weighted) on Greengenes2 (2024-09 Test Split, full-length 16S)test set self-reported0.920
- Domain Accuracy on Greengenes2 (2024-09, in-silico V3-V4 extractions)test set self-reported1.000
- Phylum Accuracy on Greengenes2 (2024-09, in-silico V3-V4 extractions)test set self-reported0.997
- Class Accuracy on Greengenes2 (2024-09, in-silico V3-V4 extractions)test set self-reported0.997
- Order Accuracy on Greengenes2 (2024-09, in-silico V3-V4 extractions)test set self-reported0.990
- Family Accuracy on Greengenes2 (2024-09, in-silico V3-V4 extractions)test set self-reported0.984
- Genus Accuracy on Greengenes2 (2024-09, in-silico V3-V4 extractions)test set self-reported0.953
- Species Accuracy on Greengenes2 (2024-09, in-silico V3-V4 extractions)test set self-reported0.875
- Species F1 (weighted) on Greengenes2 (2024-09, in-silico V3-V4 extractions)test set self-reported0.858