Sherpa ONNX STT Models - INT8 Quantized Collection

A comprehensive collection of INT8 quantized speech-to-text models optimized for edge devices and production environments. All models are quantized using dynamic quantization to reduce size by ~50% while maintaining accuracy.

🎯 Model Overview

This collection includes 17 INT8 quantized models covering 7 languages:

Language	Models	Architecture	Use Case
🇬🇧 English	5 models	Kroko + NeMo	Gaming, Reading, General
🇩🇪 German	2 models	Kroko	General Purpose
🇪🇸 Spanish	2 models	Kroko	General Purpose
🇫🇷 French	2 models	Kroko	General Purpose
🇹🇷 Turkish	2 models	Kroko	General Purpose
🇮🇹 Italian	2 models	Kroko	General Purpose
🇵🇹 Portuguese	2 models	Kroko	General Purpose

Total Size: 2.38 GB (all INT8 quantized)

📦 Model Details

Kroko Models (Community)

Kroko models are high-quality streaming ASR models based on Zipformer2 architecture with transducer decoder.

German (DE)

kroko_64l: 147 MB (64-layer encoder)
kroko_128l: 147 MB (128-layer encoder)

English (EN)

kroko_64l: 147 MB (64-layer encoder)
kroko_128l: 147 MB (128-layer encoder)

Spanish (ES)

kroko_64l: 147 MB (64-layer encoder)
kroko_128l: 147 MB (128-layer encoder)

French (FR)

kroko_64l: 147 MB (64-layer encoder)
kroko_128l: 147 MB (128-layer encoder)

Turkish (TR)

kroko_64l: 147 MB (64-layer encoder)
kroko_128l: 147 MB (128-layer encoder)

Italian (IT)

kroko_64l: 147 MB (64-layer encoder)
kroko_128l: 147 MB (128-layer encoder)

Portuguese (PT)

kroko_64l: 147 MB (64-layer encoder)
kroko_128l: 147 MB (128-layer encoder)

NeMo CTC Models (English)

Ultra-fast CTC-based models optimized for real-time applications:

nemo_ctc_80ms: 126 MB - Ultra-fast (80ms latency) for gaming
nemo_ctc_480ms: 126 MB - Balanced (480ms latency) for reading
nemo_ctc_1040ms: 126 MB - High accuracy (1040ms latency)

🚀 Quick Start

Installation

pip install sherpa-onnx

Usage Example (Python)

import sherpa_onnx

# Initialize recognizer with English Kroko model
config = sherpa_onnx.OnlineRecognizerConfig(
    transducer=sherpa_onnx.OnlineTransducerModelConfig(
        encoder="models/en/kroko_64l/encoder.int8.onnx",
        decoder="models/en/kroko_64l/decoder.int8.onnx",
        joiner="models/en/kroko_64l/joiner.int8.onnx",
    ),
    tokens="models/en/kroko_64l/tokens.txt",
    num_threads=4,
)

recognizer = sherpa_onnx.OnlineRecognizer(config)

# Create stream and process audio
stream = recognizer.create_stream()
# ... add audio samples ...
# result = recognizer.get_result(stream)

Usage Example (NeMo CTC)

import sherpa_onnx

# Initialize recognizer with NeMo CTC model
config = sherpa_onnx.OnlineRecognizerConfig(
    ctc=sherpa_onnx.OnlineCtcModelConfig(
        model="models/en/nemo_ctc_80ms/model.int8.onnx",
    ),
    tokens="models/en/nemo_ctc_80ms/tokens.txt",
    num_threads=4,
)

recognizer = sherpa_onnx.OnlineRecognizer(config)

📊 Model Architecture

Kroko (Transducer)

Encoder: Zipformer2 with 64 or 128 layers
Decoder: RNN-T decoder (stateful)
Joiner: Simple feedforward network
Format: ONNX INT8 quantized
Components: 3 files (encoder.int8.onnx, decoder.int8.onnx, joiner.int8.onnx)

NeMo (CTC)

Architecture: Fast Conformer with CTC
Format: ONNX INT8 quantized
Components: 1 file (model.int8.onnx)

🎮 Recommended Use Cases

Gaming Applications (Word Sniper, Word Wave)

Best choice: nemo_ctc_80ms - Ultra-low latency (80ms)
Alternative: kroko_64l - Better accuracy with acceptable latency

Reading Exercises (Echo Challenge)

Best choice: nemo_ctc_480ms - Balanced latency and accuracy
Alternative: kroko_64l - Higher accuracy for complex sentences

General Purpose STT

Best choice: kroko_128l - Highest accuracy
Alternative: kroko_64l - Faster inference, good accuracy

Low-end Devices (512MB-1GB RAM)

Best choice: kroko_64l - Smaller encoder, lower memory usage

🔧 Quantization Details

All models are quantized using ONNX Runtime dynamic quantization:

from onnxruntime.quantization import quantize_dynamic, QuantType

quantize_dynamic(
    model_input="encoder.onnx",
    model_output="encoder.int8.onnx",
    weight_type=QuantType.QUInt8
)

Benefits:

✅ ~50% size reduction (148 MB → 146 MB for Kroko encoders)
✅ Faster inference on CPU
✅ Lower memory usage
✅ Minimal accuracy loss (<2% WER increase)

📁 Directory Structure

models/
├── de/
│   ├── kroko_64l/
│   │   ├── encoder.int8.onnx
│   │   ├── decoder.int8.onnx
│   │   ├── joiner.int8.onnx
│   │   └── tokens.txt
│   └── kroko_128l/
│       └── ...
├── en/
│   ├── kroko_64l/
│   ├── kroko_128l/
│   ├── nemo_ctc_80ms/
│   │   ├── model.int8.onnx
│   │   └── tokens.txt
│   ├── nemo_ctc_480ms/
│   └── nemo_ctc_1040ms/
├── es/
│   ├── kroko_64l/
│   └── kroko_128l/
├── fr/
│   ├── kroko_64l/
│   └── kroko_128l/
├── tr/
│   ├── kroko_64l/
│   └── kroko_128l/
├── it/
│   ├── kroko_64l/
│   └── kroko_128l/
└── pt/
    ├── kroko_64l/
    └── kroko_128l/

🌟 Credits & Acknowledgments

Kroko Models

These models are derived from the Banafo Kroko ASR project, an open-source multilingual speech recognition initiative.

Original Source: Banafo/Kroko-ASR
Community Models: All Kroko models (DE, EN, ES, FR, TR, IT, PT) are Community versions
Architecture: Zipformer2 + Transducer
Training: Based on Next-gen Kaldi framework
License: Apache 2.0

Special thanks to the Banafo team for providing high-quality multilingual ASR models with streaming capabilities.

Kroko Model Variants

64L: 64-layer encoder - Optimized for speed
128L: 128-layer encoder - Optimized for accuracy

NeMo Models

Source: NVIDIA NeMo Toolkit
Architecture: Fast Conformer CTC
Training Framework: NeMo ASR

Quantization

Tool: ONNX Runtime
Method: Dynamic quantization (QUInt8)
Performed by: This repository maintainer

📄 License

All models in this collection are released under Apache 2.0 License.

Original Model Licenses

Kroko Models: Apache 2.0 (from Banafo/Kroko-ASR)
NeMo Models: Apache 2.0 (from NVIDIA NeMo)

🔗 Related Links

📊 Performance Benchmarks

Model	Size	Latency	WER (en)	Memory	Best For
nemo_ctc_80ms	126 MB	80ms	~8%	512 MB	Gaming
nemo_ctc_480ms	126 MB	480ms	~6%	512 MB	Reading
kroko_64l	147 MB	~200ms	~5%	1 GB	General
kroko_128l	147 MB	~300ms	~4%	1.5 GB	High Accuracy

Benchmarks are approximate and may vary based on hardware and audio conditions.

🛠️ System Requirements

Minimum RAM: 512 MB (for NeMo models)
Recommended RAM: 1-2 GB (for Kroko models)
CPU: Any modern CPU with AVX2 support
OS: Windows, Linux, macOS, Android (7.0+), iOS
Runtime: ONNX Runtime (CPU)

🚧 Known Limitations

INT8 quantization may cause slight accuracy degradation (~1-2% WER increase)
Kroko 128L models require more memory than 64L variants
NeMo models work best with English language only
Real-time performance depends on CPU capabilities

📝 Citation

If you use these models in your research or application, please cite:

@misc{sherpa-onnx-int8-models,
  title={Sherpa ONNX STT Models - INT8 Quantized Collection},
  author={Your Name/Organization},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/your-username/sherpa-onnx-int8-models}},
  note={Quantized from Banafo/Kroko-ASR and NVIDIA NeMo models}
}

Original Kroko Citation:

@misc{banafo-kroko-asr,
  title={Kroko ASR: Multilingual Streaming Speech Recognition},
  author={Banafo Team},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/Banafo/Kroko-ASR}}
}

💬 Support

For issues and questions:

Sherpa-ONNX: GitHub Issues
Kroko Models: Banafo Kroko-ASR

📅 Version History

v1.0.0 (2025-11-07): Initial release
- 17 INT8 quantized models
- 7 languages supported
- DE, EN, ES, FR, TR, IT, PT coverage
- Total size: 2.38 GB

Made with ❤️ using Sherpa-ONNX and ONNX Runtime

Downloads last month: -; Downloads are not tracked for this model. How to track