Streaming Zipformer (Small) β€” AISHELL-1 ASR

Streaming (causal) Zipformer2 model for Mandarin Chinese ASR, trained on AISHELL-1. Trained with icefall using pruned RNN-T loss.

Model Description

  • Architecture: Zipformer2 (small, causal/streaming)
  • Task: Automatic Speech Recognition (Mandarin Chinese)
  • Dataset: AISHELL-1 (~170 hours)
  • Parameters: ~23M
  • Streaming: Yes β€” chunk-size 16 frames (~320ms latency)
  • Training framework: icefall + k2

Model Architecture

Parameter Value
num-encoder-layers 2,2,2,2,2,2
encoder-dim 192,256,256,384,256,256
feedforward-dim 512,768,768,1024,768,768
num-heads 4,4,4,8,4,4
downsampling-factor 1,2,4,8,4,2
decoder-dim 512
joiner-dim 512
chunk-size (inference) 16 frames (~320ms)
left-context-frames 64

Training Details

Item Value
Epochs 30
Batch max-duration 700s per step (dynamic batching)
GPUs 8 Γ— NVIDIA A10 (22GB)
Optimizer ScaledAdam
Learning rate 0.045 (with warmup)
FP16 Yes
Data augmentation SpecAugment + MUSAN noise
Average checkpoints epoch 22–30 (avg=9)

Evaluation Results (CER)

Evaluated using greedy search, chunk-size=16, left-context=64.

Test Set CER
AISHELL-1 dev 9.67%
AISHELL-1 test 10.47%

Note: This is a streaming model. Non-streaming models typically achieve lower CER (~4-5%) due to full context access.

Files

File Description Size
pretrained.pt Averaged PyTorch weights (epoch 22–30), load with model.load_state_dict() 135MB
encoder-epoch-30-avg-9-chunk-16-left-64.onnx Encoder (fp32 ONNX) 103MB
decoder-epoch-30-avg-9-chunk-16-left-64.onnx Decoder (fp32 ONNX) 9.5MB
joiner-epoch-30-avg-9-chunk-16-left-64.onnx Joiner (fp32 ONNX) 8.5MB
encoder-epoch-30-avg-9-chunk-16-left-64.int8.onnx Encoder (int8 quantized ONNX) 30MB
decoder-epoch-30-avg-9-chunk-16-left-64.int8.onnx Decoder (int8 quantized ONNX) 2.4MB
joiner-epoch-30-avg-9-chunk-16-left-64.int8.onnx Joiner (int8 quantized ONNX) 2.2MB
tokens.txt Character vocabulary (4336 tokens) 38KB

Usage

Load PyTorch weights

import torch
from zipformer import Zipformer2  # from icefall

checkpoint = torch.load("pretrained.pt", map_location="cpu")
model.load_state_dict(checkpoint["model"])
model.eval()

Run inference with icefall

# Clone icefall
git clone https://github.com/k2-fsa/icefall
cd icefall/egs/aishell/ASR

# Greedy search inference on a wav file
python zipformer/pretrained.py \
  --checkpoint /path/to/pretrained.pt \
  --tokens /path/to/tokens.txt \
  --causal 1 \
  --chunk-size 16 \
  --left-context-frames 64 \
  --method greedy_search \
  /path/to/audio.wav

Run inference with sherpa-onnx (recommended for deployment)

sherpa-onnx supports streaming inference using the ONNX files directly, with Python, C++, Android, iOS, and more.

pip install sherpa-onnx

python -c "
import sherpa_onnx
recognizer = sherpa_onnx.OnlineRecognizer.from_transducer(
    encoder='encoder-epoch-30-avg-9-chunk-16-left-64.onnx',
    decoder='decoder-epoch-30-avg-9-chunk-16-left-64.onnx',
    joiner='joiner-epoch-30-avg-9-chunk-16-left-64.onnx',
    tokens='tokens.txt',
    num_threads=4,
    decoding_method='greedy_search',
)
"

Citation

If you use this model, please cite:

@inproceedings{yao2023zipformer,
  title={Zipformer: A faster and better encoder for automatic speech recognition},
  author={Yao, Zengwei and Guo, Liyong and Yang, Xiaoyu and Kang, Wei and Lhotse, Daniel and Yang, Fangjun and Wang, Wei and Povey, Daniel},
  booktitle={ICLR},
  year={2024}
}

@inproceedings{bu2017aishell,
  title={AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline},
  author={Bu, Hui and Du, Jiayu and Na, Xingyu and Wu, Bengu and Zheng, Hao},
  booktitle={Proceedings of OCOCOSDA},
  year={2017}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support