Sherpa ONNX STT Models - INT8 Quantized Collection
A comprehensive collection of INT8 quantized speech-to-text models optimized for edge devices and production environments. All models are quantized using dynamic quantization to reduce size by ~50% while maintaining accuracy.
🎯 Model Overview
This collection includes 17 INT8 quantized models covering 7 languages:
| Language | Models | Architecture | Use Case |
|---|---|---|---|
| 🇬🇧 English | 5 models | Kroko + NeMo | Gaming, Reading, General |
| 🇩🇪 German | 2 models | Kroko | General Purpose |
| 🇪🇸 Spanish | 2 models | Kroko | General Purpose |
| 🇫🇷 French | 2 models | Kroko | General Purpose |
| 🇹🇷 Turkish | 2 models | Kroko | General Purpose |
| 🇮🇹 Italian | 2 models | Kroko | General Purpose |
| 🇵🇹 Portuguese | 2 models | Kroko | General Purpose |
Total Size: 2.38 GB (all INT8 quantized)
📦 Model Details
Kroko Models (Community)
Kroko models are high-quality streaming ASR models based on Zipformer2 architecture with transducer decoder.
German (DE)
- kroko_64l: 147 MB (64-layer encoder)
- kroko_128l: 147 MB (128-layer encoder)
English (EN)
- kroko_64l: 147 MB (64-layer encoder)
- kroko_128l: 147 MB (128-layer encoder)
Spanish (ES)
- kroko_64l: 147 MB (64-layer encoder)
- kroko_128l: 147 MB (128-layer encoder)
French (FR)
- kroko_64l: 147 MB (64-layer encoder)
- kroko_128l: 147 MB (128-layer encoder)
Turkish (TR)
- kroko_64l: 147 MB (64-layer encoder)
- kroko_128l: 147 MB (128-layer encoder)
Italian (IT)
- kroko_64l: 147 MB (64-layer encoder)
- kroko_128l: 147 MB (128-layer encoder)
Portuguese (PT)
- kroko_64l: 147 MB (64-layer encoder)
- kroko_128l: 147 MB (128-layer encoder)
NeMo CTC Models (English)
Ultra-fast CTC-based models optimized for real-time applications:
- nemo_ctc_80ms: 126 MB - Ultra-fast (80ms latency) for gaming
- nemo_ctc_480ms: 126 MB - Balanced (480ms latency) for reading
- nemo_ctc_1040ms: 126 MB - High accuracy (1040ms latency)
🚀 Quick Start
Installation
pip install sherpa-onnx
Usage Example (Python)
import sherpa_onnx
# Initialize recognizer with English Kroko model
config = sherpa_onnx.OnlineRecognizerConfig(
transducer=sherpa_onnx.OnlineTransducerModelConfig(
encoder="models/en/kroko_64l/encoder.int8.onnx",
decoder="models/en/kroko_64l/decoder.int8.onnx",
joiner="models/en/kroko_64l/joiner.int8.onnx",
),
tokens="models/en/kroko_64l/tokens.txt",
num_threads=4,
)
recognizer = sherpa_onnx.OnlineRecognizer(config)
# Create stream and process audio
stream = recognizer.create_stream()
# ... add audio samples ...
# result = recognizer.get_result(stream)
Usage Example (NeMo CTC)
import sherpa_onnx
# Initialize recognizer with NeMo CTC model
config = sherpa_onnx.OnlineRecognizerConfig(
ctc=sherpa_onnx.OnlineCtcModelConfig(
model="models/en/nemo_ctc_80ms/model.int8.onnx",
),
tokens="models/en/nemo_ctc_80ms/tokens.txt",
num_threads=4,
)
recognizer = sherpa_onnx.OnlineRecognizer(config)
📊 Model Architecture
Kroko (Transducer)
- Encoder: Zipformer2 with 64 or 128 layers
- Decoder: RNN-T decoder (stateful)
- Joiner: Simple feedforward network
- Format: ONNX INT8 quantized
- Components: 3 files (encoder.int8.onnx, decoder.int8.onnx, joiner.int8.onnx)
NeMo (CTC)
- Architecture: Fast Conformer with CTC
- Format: ONNX INT8 quantized
- Components: 1 file (model.int8.onnx)
🎮 Recommended Use Cases
Gaming Applications (Word Sniper, Word Wave)
- Best choice:
nemo_ctc_80ms- Ultra-low latency (80ms) - Alternative:
kroko_64l- Better accuracy with acceptable latency
Reading Exercises (Echo Challenge)
- Best choice:
nemo_ctc_480ms- Balanced latency and accuracy - Alternative:
kroko_64l- Higher accuracy for complex sentences
General Purpose STT
- Best choice:
kroko_128l- Highest accuracy - Alternative:
kroko_64l- Faster inference, good accuracy
Low-end Devices (512MB-1GB RAM)
- Best choice:
kroko_64l- Smaller encoder, lower memory usage
🔧 Quantization Details
All models are quantized using ONNX Runtime dynamic quantization:
from onnxruntime.quantization import quantize_dynamic, QuantType
quantize_dynamic(
model_input="encoder.onnx",
model_output="encoder.int8.onnx",
weight_type=QuantType.QUInt8
)
Benefits:
- ✅ ~50% size reduction (148 MB → 146 MB for Kroko encoders)
- ✅ Faster inference on CPU
- ✅ Lower memory usage
- ✅ Minimal accuracy loss (<2% WER increase)
📁 Directory Structure
models/
├── de/
│ ├── kroko_64l/
│ │ ├── encoder.int8.onnx
│ │ ├── decoder.int8.onnx
│ │ ├── joiner.int8.onnx
│ │ └── tokens.txt
│ └── kroko_128l/
│ └── ...
├── en/
│ ├── kroko_64l/
│ ├── kroko_128l/
│ ├── nemo_ctc_80ms/
│ │ ├── model.int8.onnx
│ │ └── tokens.txt
│ ├── nemo_ctc_480ms/
│ └── nemo_ctc_1040ms/
├── es/
│ ├── kroko_64l/
│ └── kroko_128l/
├── fr/
│ ├── kroko_64l/
│ └── kroko_128l/
├── tr/
│ ├── kroko_64l/
│ └── kroko_128l/
├── it/
│ ├── kroko_64l/
│ └── kroko_128l/
└── pt/
├── kroko_64l/
└── kroko_128l/
🌟 Credits & Acknowledgments
Kroko Models
These models are derived from the Banafo Kroko ASR project, an open-source multilingual speech recognition initiative.
- Original Source: Banafo/Kroko-ASR
- Community Models: All Kroko models (DE, EN, ES, FR, TR, IT, PT) are Community versions
- Architecture: Zipformer2 + Transducer
- Training: Based on Next-gen Kaldi framework
- License: Apache 2.0
Special thanks to the Banafo team for providing high-quality multilingual ASR models with streaming capabilities.
Kroko Model Variants
- 64L: 64-layer encoder - Optimized for speed
- 128L: 128-layer encoder - Optimized for accuracy
NeMo Models
- Source: NVIDIA NeMo Toolkit
- Architecture: Fast Conformer CTC
- Training Framework: NeMo ASR
Quantization
- Tool: ONNX Runtime
- Method: Dynamic quantization (QUInt8)
- Performed by: This repository maintainer
📄 License
All models in this collection are released under Apache 2.0 License.
Original Model Licenses
- Kroko Models: Apache 2.0 (from Banafo/Kroko-ASR)
- NeMo Models: Apache 2.0 (from NVIDIA NeMo)
🔗 Related Links
📊 Performance Benchmarks
| Model | Size | Latency | WER (en) | Memory | Best For |
|---|---|---|---|---|---|
| nemo_ctc_80ms | 126 MB | 80ms | ~8% | 512 MB | Gaming |
| nemo_ctc_480ms | 126 MB | 480ms | ~6% | 512 MB | Reading |
| kroko_64l | 147 MB | ~200ms | ~5% | 1 GB | General |
| kroko_128l | 147 MB | ~300ms | ~4% | 1.5 GB | High Accuracy |
Benchmarks are approximate and may vary based on hardware and audio conditions.
🛠️ System Requirements
- Minimum RAM: 512 MB (for NeMo models)
- Recommended RAM: 1-2 GB (for Kroko models)
- CPU: Any modern CPU with AVX2 support
- OS: Windows, Linux, macOS, Android (7.0+), iOS
- Runtime: ONNX Runtime (CPU)
🚧 Known Limitations
- INT8 quantization may cause slight accuracy degradation (~1-2% WER increase)
- Kroko 128L models require more memory than 64L variants
- NeMo models work best with English language only
- Real-time performance depends on CPU capabilities
📝 Citation
If you use these models in your research or application, please cite:
@misc{sherpa-onnx-int8-models,
title={Sherpa ONNX STT Models - INT8 Quantized Collection},
author={Your Name/Organization},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/your-username/sherpa-onnx-int8-models}},
note={Quantized from Banafo/Kroko-ASR and NVIDIA NeMo models}
}
Original Kroko Citation:
@misc{banafo-kroko-asr,
title={Kroko ASR: Multilingual Streaming Speech Recognition},
author={Banafo Team},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/Banafo/Kroko-ASR}}
}
💬 Support
For issues and questions:
- Sherpa-ONNX: GitHub Issues
- Kroko Models: Banafo Kroko-ASR
📅 Version History
- v1.0.0 (2025-11-07): Initial release
- 17 INT8 quantized models
- 7 languages supported
- DE, EN, ES, FR, TR, IT, PT coverage
- Total size: 2.38 GB
Made with ❤️ using Sherpa-ONNX and ONNX Runtime