Update README.md

f8e0bf3 verified about 16 hours ago

6.5 kB

frameworks:
  - ''
language:
  - zh
license: apache-2.0
tags:
  - speech
  - asr
tasks: []

Dolphin-CN-Dialect

Paper Github Huggingface Modelscope

Repository Notice

This model is officially maintained by Dataocean AI.

To ensure compatibility with existing user code and download links, we keep two official repositories for the same model:

Original / legacy repository: DataoceanAI
Organization / enterprise repository: DataoceanAI1

Both repositories are maintained by the same team and contain the same model files.
DataoceanAI1 is the newly created enterprise organization account, while DataoceanAI is kept to avoid breaking existing user download scripts and links.

Please do not regard either repository as an unofficial copy or unauthorized redistribution.

Dolphin-CN-Dialect is a multi-dialect ASR model developed by Dataocean AI and Tsinghua University, with a strong focus on Chinese dialect recognition and real-world deployment scenarios. Compared with the previous Dolphin series, Dolphin-CN-Dialect introduces significant improvements in tokenizer design, dialect-balanced training, streaming capability, hotword biasing, and deployment efficiency.

The model supports Mandarin Chinese and 22 Chinese dialects, while also maintaining multilingual ASR capability inherited from Dolphin. Dolphin-CN-Dialect supports both streaming and non-streaming inference, enabling practical deployment in latency-sensitive applications such as real-time transcription and industrial speech recognition systems.

Approach

Dolphin-CN-Dialect is built upon the Dolphin architecture and follows a joint CTC-Attention framework with:

Encoder: E-Branchformer
Decoder: Transformer Decoder
Training Objective: Joint CTC + Attention loss

Compared to Dolphin, Dolphin-CN-Dialect introduces several important improvements:

Temperature-based data sampling for balancing standard Mandarin and low-resource dialects
Redesigned tokenizer with:
- character-level modeling for Chinese
- BPE-based subword modeling for English
- extensible dialect tokens
Streaming ASR support
Hotword-biased decoding, including:
- encoder-level contextual biasing
- prompt-based decoder biasing

Experimental results show that Dolphin-CN-Dialect achieves:

38% improvement in dialect recognition accuracy
16.3% relative CER reduction over Dolphin
Competitive performance with recent large-scale ASR systems while maintaining a smaller model size

See details in the Paper.

Setup

Dolphin-CN-Dialect requires FFmpeg to convert audio files into WAV format. Please install FFmpeg first if it is not already installed on your system.

# Ubuntu / Debian
sudo apt update && sudo apt install ffmpeg
# MacOS
brew install ffmpeg
# Windows
choco install ffmpeg

Install Dolphin with pip:

pip install -U dolphin

Alternatively, install from source:

pip install git+https://github.com/DataoceanAI/Dolphin.git

Available Models

Currently, Dolphin-CN-Dialect provides multiple model sizes optimized for different deployment scenarios.

Model	Parameters	Hotwords
base.cn	0.1 B	❌
base.cn.streaming	0.1 B	❌
small.cn	0.4 B	Encoder-biased Hotwords
small.cn.streaming	0.4 B	Encoder-biased Hotwords
small.cn.prompt	0.4 B	Prompt-based Hotwords

Hotword Biasing

Dolphin-CN-Dialect supports two hotword biasing approaches.

Encoder-Level Contextual Biasing

Supports both streaming and non-streaming models
Integrates contextual embeddings into encoder representations
Efficient adaptation without retraining the full model

Prompt-Based Hotword Biasing

Designed for non-streaming models
Injects hotwords directly into decoder prompts
Particularly effective for long-tail and rare phrases

Experimental results show significant reductions in hotword error rates while maintaining strong overall ASR performance.

Supported Languages and Dialects

Dolphin-CN-Dialect primarily focuses on:

Mandarin Chinese
22 Chinese dialects
Regional accented Mandarin

Supported dialects include:

Sichuan
Wu
Minnan
Shanghai
Gansu
Guangdong
Wenzhou
Hunan
Anhui
Henan
Fujian
Hebei
Liaoning
Shaanxi
Tianjin
and more

For the complete language and dialect list, see languages.md.

Supported Devices

Device Type	Support Status
CUDA	✅Supported
MPS (Apple)	✅Supported
CPU	✅Supported

Usage

Command-line usage

dolphin audio.wav

# Download model and specify the model path
dolphin audio.wav --model small.cn --model_dir /data/models/dolphin/

# Specify language and region
dolphin audio.wav --model small.cn --model_dir /data/models/dolphin/ --lang_sym "zh" --region_sym "CN"

# Specify the hotwords file with Encoder-biased method
dolphin audio.wav --model small.cn --model_dir /data/models/dolphin/ --hotword_list_path hotwords.txt --use_deep_biasing true

# Using prompt-based model
dolphin audio.wav --model small.cn.prompt --model_dir /data/models/dolphin/ --hotword_list_path hotwords.txt --use_prompt_hotword true --use_two_stage_filter true

Python usage

import dolphin
from dolphin import transcribe

model_name = 'small.cn'
model = dolphin.load_model(model_name, device="cuda")

result = transcribe(model, 'audio.wav')
print(result.text)

# Specify language
result = transcribe(model, 'audio.wav', lang_sym="zh")
print(result.text)

# Specify language and region and encoder-biased hotwords
result = transcribe(model, 'audio.wav', lang_sym="zh", region_sym="CN", hotwords=['诺香丹青牌科研胶囊'], use_deep_biasing=True, use_two_stage_filter=True)
print(result.text)

## prompt-based hotwords

model_name = 'small.cn.prompt'
model = dolphin.load_model(model_name, device="cuda")

result = transcribe(model, 'audio.wav', hotwords=['诺香丹青牌科研胶囊'], use_prompt_hotword=True, use_two_stage_filter=True, decoding_method='attention')

print(result.text)

License

Dolphin-CN-Dialect is released under the Apache 2.0 License.