DataoceanAI's picture
Update README.md
21b7b1c verified
---
frameworks:
- ""
language:
- zh
license: apache-2.0
tags:
- speech
- asr
tasks: []
---
# Dolphin-CN-Dialect
[Paper](https://arxiv.org/abs/2605.08961)
[Github](https://github.com/DataoceanAI/Dolphin)
[Huggingface](https://huggingface.co/DataoceanAI)
[Modelscope](https://www.modelscope.cn/organization/DataoceanAI)
# Repository Notice
This model is officially maintained by **Dataocean AI**.
To ensure compatibility with existing user code and download links, we keep two official repositories for the same model:
- Original / legacy repository: `DataoceanAI`
- Organization / enterprise repository: `DataoceanAI1`
Both repositories are maintained by the same team and contain the same model files.
`DataoceanAI1` is the newly created enterprise organization account, while `DataoceanAI` is kept to avoid breaking existing user download scripts and links.
Please do not regard either repository as an unofficial copy or unauthorized redistribution.
**Dolphin-CN-Dialect** is a multi-dialect ASR model developed by Dataocean AI and Tsinghua University, with a strong focus on Chinese dialect recognition and real-world deployment scenarios. Compared with the previous Dolphin series, Dolphin-CN-Dialect introduces significant improvements in tokenizer design, dialect-balanced training, streaming capability, hotword biasing, and deployment efficiency.
The model supports Mandarin Chinese and 22 Chinese dialects, while also maintaining multilingual ASR capability inherited from Dolphin. Dolphin-CN-Dialect supports both streaming and non-streaming inference, enabling practical deployment in latency-sensitive applications such as real-time transcription and industrial speech recognition systems.
**Dolphin-CN-Dialect** is a multi-dialect ASR model developed by Dataocean AI and Tsinghua University, with a strong focus on Chinese dialect recognition and real-world deployment scenarios. Compared with the previous Dolphin series, Dolphin-CN-Dialect introduces significant improvements in tokenizer design, dialect-balanced training, streaming capability, hotword biasing, and deployment efficiency.
The model supports Mandarin Chinese and 22 Chinese dialects, while also maintaining multilingual ASR capability inherited from Dolphin. Dolphin-CN-Dialect supports both streaming and non-streaming inference, enabling practical deployment in latency-sensitive applications such as real-time transcription and industrial speech recognition systems.
## Approach
Dolphin-CN-Dialect is built upon the Dolphin architecture and follows a joint CTC-Attention framework with:
* Encoder: E-Branchformer
* Decoder: Transformer Decoder
* Training Objective: Joint CTC + Attention loss
Compared to Dolphin, Dolphin-CN-Dialect introduces several important improvements:
* Temperature-based data sampling for balancing standard Mandarin and low-resource dialects
* Redesigned tokenizer with:
* character-level modeling for Chinese
* BPE-based subword modeling for English
* extensible dialect tokens
* Streaming ASR support
* Hotword-biased decoding, including:
* encoder-level contextual biasing
* prompt-based decoder biasing
Experimental results show that Dolphin-CN-Dialect achieves:
* 38% improvement in dialect recognition accuracy
* 16.3% relative CER reduction over Dolphin
* Competitive performance with recent large-scale ASR systems while maintaining a smaller model size
![Dolphin-CN-Dialect 特色海报](Dolphin-CN-Dialect.png)
See details in the [Paper](https://arxiv.org/abs/2605.08961).
## Setup
Dolphin-CN-Dialect requires FFmpeg to convert audio files into WAV format. Please install FFmpeg first if it is not already installed on your system.
```shell
# Ubuntu / Debian
sudo apt update && sudo apt install ffmpeg
# MacOS
brew install ffmpeg
# Windows
choco install ffmpeg
```
Install Dolphin with pip:
```shell
pip install -U dolphin
```
Alternatively, install from source:
```shell
pip install git+https://github.com/DataoceanAI/Dolphin.git
```
## Available Models
Currently, Dolphin-CN-Dialect provides multiple model sizes optimized for different deployment scenarios.
| Model | Parameters | Hotwords |
|:------:|:----------:|:----------:|
| base.cn | 0.1 B | ❌ |
| base.cn.streaming | 0.1 B |❌ |
| small.cn | 0.4 B | Encoder-biased Hotwords |
| small.cn.streaming | 0.4 B | Encoder-biased Hotwords |
| small.cn.prompt | 0.4 B | Prompt-based Hotwords |
## Hotword Biasing
Dolphin-CN-Dialect supports two hotword biasing approaches.
**Encoder-Level Contextual Biasing**
* Supports both streaming and non-streaming models
* Integrates contextual embeddings into encoder representations
* Efficient adaptation without retraining the full model
**Prompt-Based Hotword Biasing**
* Designed for non-streaming models
* Injects hotwords directly into decoder prompts
* Particularly effective for long-tail and rare phrases
Experimental results show significant reductions in hotword error rates while maintaining strong overall ASR performance.
## Supported Languages and Dialects
Dolphin-CN-Dialect primarily focuses on:
* Mandarin Chinese
* 22 Chinese dialects
* Regional accented Mandarin
Supported dialects include:
* Sichuan
* Wu
* Minnan
* Shanghai
* Gansu
* Guangdong
* Wenzhou
* Hunan
* Anhui
* Henan
* Fujian
* Hebei
* Liaoning
* Shaanxi
* Tianjin
* and more
For the complete language and dialect list, see [languages.md](./languages.md).
## Supported Devices
| Device Type | Support Status |
|:-------------:|:----------------:|
|**CUDA**|✅Supported|
|**MPS (Apple)**|✅Supported|
|**CPU**|✅Supported|
## Usage
### Command-line usage
```shell
dolphin audio.wav
# Download model and specify the model path
dolphin audio.wav --model small.cn --model_dir /data/models/dolphin/
# Specify language and region
dolphin audio.wav --model small.cn --model_dir /data/models/dolphin/ --lang_sym "zh" --region_sym "CN"
# Specify the hotwords file with Encoder-biased method
dolphin audio.wav --model small.cn --model_dir /data/models/dolphin/ --hotword_list_path hotwords.txt --use_deep_biasing true
# Using prompt-based model
dolphin audio.wav --model small.cn.prompt --model_dir /data/models/dolphin/ --hotword_list_path hotwords.txt --use_prompt_hotword true --use_two_stage_filter true
```
### Python usage
```python
import dolphin
from dolphin import transcribe
model_name = 'small.cn'
model = dolphin.load_model(model_name, device="cuda")
result = transcribe(model, 'audio.wav')
print(result.text)
# Specify language
result = transcribe(model, 'audio.wav', lang_sym="zh")
print(result.text)
# Specify language and region and encoder-biased hotwords
result = transcribe(model, 'audio.wav', lang_sym="zh", region_sym="CN", hotwords=['诺香丹青牌科研胶囊'], use_deep_biasing=True, use_two_stage_filter=True)
print(result.text)
## prompt-based hotwords
model_name = 'small.cn.prompt'
model = dolphin.load_model(model_name, device="cuda")
result = transcribe(model, 'audio.wav', hotwords=['诺香丹青牌科研胶囊'], use_prompt_hotword=True, use_two_stage_filter=True, decoding_method='attention')
print(result.text)
```
## License
Dolphin-CN-Dialect is released under the Apache 2.0 License.