Streaming Zipformer (Small) — AISHELL-1 ASR

Streaming (causal) Zipformer2 model for Mandarin Chinese ASR, trained on AISHELL-1. Trained with icefall using pruned RNN-T loss.

Model Description

Architecture: Zipformer2 (small, causal/streaming)
Task: Automatic Speech Recognition (Mandarin Chinese)
Dataset: AISHELL-1 (~170 hours)
Parameters: ~23M
Streaming: Yes — chunk-size 16 frames (~320ms latency)
Training framework: icefall + k2

Model Architecture

Parameter	Value
num-encoder-layers	2,2,2,2,2,2
encoder-dim	192,256,256,384,256,256
feedforward-dim	512,768,768,1024,768,768
num-heads	4,4,4,8,4,4
downsampling-factor	1,2,4,8,4,2
decoder-dim	512
joiner-dim	512
chunk-size (inference)	16 frames (~320ms)
left-context-frames	64

Training Details

Item	Value
Epochs	30
Batch max-duration	700s per step (dynamic batching)
GPUs	8 × NVIDIA A10 (22GB)
Optimizer	ScaledAdam
Learning rate	0.045 (with warmup)
FP16	Yes
Data augmentation	SpecAugment + MUSAN noise
Average checkpoints	epoch 22–30 (avg=9)

Evaluation Results (CER)

Evaluated using greedy search, chunk-size=16, left-context=64.

Test Set	CER
AISHELL-1 dev	9.67%
AISHELL-1 test	10.47%

Note: This is a streaming model. Non-streaming models typically achieve lower CER (~4-5%) due to full context access.

Files

File	Description	Size
`pretrained.pt`	Averaged PyTorch weights (epoch 22–30), load with `model.load_state_dict()`	135MB
`encoder-epoch-30-avg-9-chunk-16-left-64.onnx`	Encoder (fp32 ONNX)	103MB
`decoder-epoch-30-avg-9-chunk-16-left-64.onnx`	Decoder (fp32 ONNX)	9.5MB
`joiner-epoch-30-avg-9-chunk-16-left-64.onnx`	Joiner (fp32 ONNX)	8.5MB
`encoder-epoch-30-avg-9-chunk-16-left-64.int8.onnx`	Encoder (int8 quantized ONNX)	30MB
`decoder-epoch-30-avg-9-chunk-16-left-64.int8.onnx`	Decoder (int8 quantized ONNX)	2.4MB
`joiner-epoch-30-avg-9-chunk-16-left-64.int8.onnx`	Joiner (int8 quantized ONNX)	2.2MB
`tokens.txt`	Character vocabulary (4336 tokens)	38KB

Usage

Load PyTorch weights

import torch
from zipformer import Zipformer2  # from icefall

checkpoint = torch.load("pretrained.pt", map_location="cpu")
model.load_state_dict(checkpoint["model"])
model.eval()

Run inference with icefall

# Clone icefall
git clone https://github.com/k2-fsa/icefall
cd icefall/egs/aishell/ASR

# Greedy search inference on a wav file
python zipformer/pretrained.py \
  --checkpoint /path/to/pretrained.pt \
  --tokens /path/to/tokens.txt \
  --causal 1 \
  --chunk-size 16 \
  --left-context-frames 64 \
  --method greedy_search \
  /path/to/audio.wav

Run inference with sherpa-onnx (recommended for deployment)

sherpa-onnx supports streaming inference using the ONNX files directly, with Python, C++, Android, iOS, and more.

pip install sherpa-onnx

python -c "
import sherpa_onnx
recognizer = sherpa_onnx.OnlineRecognizer.from_transducer(
    encoder='encoder-epoch-30-avg-9-chunk-16-left-64.onnx',
    decoder='decoder-epoch-30-avg-9-chunk-16-left-64.onnx',
    joiner='joiner-epoch-30-avg-9-chunk-16-left-64.onnx',
    tokens='tokens.txt',
    num_threads=4,
    decoding_method='greedy_search',
)
"

Citation

If you use this model, please cite:

@inproceedings{yao2023zipformer,
  title={Zipformer: A faster and better encoder for automatic speech recognition},
  author={Yao, Zengwei and Guo, Liyong and Yang, Xiaoyu and Kang, Wei and Lhotse, Daniel and Yang, Fangjun and Wang, Wei and Povey, Daniel},
  booktitle={ICLR},
  year={2024}
}

@inproceedings{bu2017aishell,
  title={AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline},
  author={Bu, Hui and Du, Jiayu and Na, Xingyu and Wu, Bengu and Zheng, Hao},
  booktitle={Proceedings of OCOCOSDA},
  year={2017}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track