Streaming Zipformer (Small) β AISHELL-1 ASR
Streaming (causal) Zipformer2 model for Mandarin Chinese ASR, trained on AISHELL-1. Trained with icefall using pruned RNN-T loss.
Model Description
- Architecture: Zipformer2 (small, causal/streaming)
- Task: Automatic Speech Recognition (Mandarin Chinese)
- Dataset: AISHELL-1 (~170 hours)
- Parameters: ~23M
- Streaming: Yes β chunk-size 16 frames (~320ms latency)
- Training framework: icefall + k2
Model Architecture
| Parameter | Value |
|---|---|
| num-encoder-layers | 2,2,2,2,2,2 |
| encoder-dim | 192,256,256,384,256,256 |
| feedforward-dim | 512,768,768,1024,768,768 |
| num-heads | 4,4,4,8,4,4 |
| downsampling-factor | 1,2,4,8,4,2 |
| decoder-dim | 512 |
| joiner-dim | 512 |
| chunk-size (inference) | 16 frames (~320ms) |
| left-context-frames | 64 |
Training Details
| Item | Value |
|---|---|
| Epochs | 30 |
| Batch max-duration | 700s per step (dynamic batching) |
| GPUs | 8 Γ NVIDIA A10 (22GB) |
| Optimizer | ScaledAdam |
| Learning rate | 0.045 (with warmup) |
| FP16 | Yes |
| Data augmentation | SpecAugment + MUSAN noise |
| Average checkpoints | epoch 22β30 (avg=9) |
Evaluation Results (CER)
Evaluated using greedy search, chunk-size=16, left-context=64.
| Test Set | CER |
|---|---|
| AISHELL-1 dev | 9.67% |
| AISHELL-1 test | 10.47% |
Note: This is a streaming model. Non-streaming models typically achieve lower CER (~4-5%) due to full context access.
Files
| File | Description | Size |
|---|---|---|
pretrained.pt |
Averaged PyTorch weights (epoch 22β30), load with model.load_state_dict() |
135MB |
encoder-epoch-30-avg-9-chunk-16-left-64.onnx |
Encoder (fp32 ONNX) | 103MB |
decoder-epoch-30-avg-9-chunk-16-left-64.onnx |
Decoder (fp32 ONNX) | 9.5MB |
joiner-epoch-30-avg-9-chunk-16-left-64.onnx |
Joiner (fp32 ONNX) | 8.5MB |
encoder-epoch-30-avg-9-chunk-16-left-64.int8.onnx |
Encoder (int8 quantized ONNX) | 30MB |
decoder-epoch-30-avg-9-chunk-16-left-64.int8.onnx |
Decoder (int8 quantized ONNX) | 2.4MB |
joiner-epoch-30-avg-9-chunk-16-left-64.int8.onnx |
Joiner (int8 quantized ONNX) | 2.2MB |
tokens.txt |
Character vocabulary (4336 tokens) | 38KB |
Usage
Load PyTorch weights
import torch
from zipformer import Zipformer2 # from icefall
checkpoint = torch.load("pretrained.pt", map_location="cpu")
model.load_state_dict(checkpoint["model"])
model.eval()
Run inference with icefall
# Clone icefall
git clone https://github.com/k2-fsa/icefall
cd icefall/egs/aishell/ASR
# Greedy search inference on a wav file
python zipformer/pretrained.py \
--checkpoint /path/to/pretrained.pt \
--tokens /path/to/tokens.txt \
--causal 1 \
--chunk-size 16 \
--left-context-frames 64 \
--method greedy_search \
/path/to/audio.wav
Run inference with sherpa-onnx (recommended for deployment)
sherpa-onnx supports streaming inference using the ONNX files directly, with Python, C++, Android, iOS, and more.
pip install sherpa-onnx
python -c "
import sherpa_onnx
recognizer = sherpa_onnx.OnlineRecognizer.from_transducer(
encoder='encoder-epoch-30-avg-9-chunk-16-left-64.onnx',
decoder='decoder-epoch-30-avg-9-chunk-16-left-64.onnx',
joiner='joiner-epoch-30-avg-9-chunk-16-left-64.onnx',
tokens='tokens.txt',
num_threads=4,
decoding_method='greedy_search',
)
"
Citation
If you use this model, please cite:
@inproceedings{yao2023zipformer,
title={Zipformer: A faster and better encoder for automatic speech recognition},
author={Yao, Zengwei and Guo, Liyong and Yang, Xiaoyu and Kang, Wei and Lhotse, Daniel and Yang, Fangjun and Wang, Wei and Povey, Daniel},
booktitle={ICLR},
year={2024}
}
@inproceedings{bu2017aishell,
title={AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline},
author={Bu, Hui and Du, Jiayu and Na, Xingyu and Wu, Bengu and Zheng, Hao},
booktitle={Proceedings of OCOCOSDA},
year={2017}
}
License
Apache 2.0