File size: 6,508 Bytes
db1882f
09e1a61
 
db1882f
 
09e1a61
db1882f
 
 
09e1a61
db1882f
 
09e1a61
db1882f
8726167
db1882f
 
 
 
eb68549
 
 
 
 
 
 
 
 
 
 
 
 
 
09e1a61
db1882f
09e1a61
db1882f
 
 
 
09e1a61
db1882f
 
 
 
 
09e1a61
db1882f
 
 
 
 
 
 
 
 
 
 
09e1a61
db1882f
 
 
 
 
42f7102
db1882f
 
eb68549
db1882f
 
 
 
09e1a61
db1882f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09e1a61
db1882f
 
 
09e1a61
 
 
 
 
db1882f
 
 
 
09e1a61
db1882f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09e1a61
db1882f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09e1a61
db1882f
 
09e1a61
db1882f
 
09e1a61
db1882f
 
09e1a61
db1882f
 
 
 
 
 
 
 
 
09e1a61
 
db1882f
 
 
 
 
 
 
 
 
 
 
 
 
 
09e1a61
 
db1882f
 
 
 
 
 
 
 
 
 
09e1a61
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
---
frameworks:
- ""
language:
- zh
license: apache-2.0
tags:
- speech
- asr
tasks: []
---

# Dolphin-CN-Dialect

[Paper](https://arxiv.org/abs/2605.08961)
[Github](https://github.com/DataoceanAI/Dolphin)
[Huggingface](https://huggingface.co/DataoceanAI)
[Modelscope](https://www.modelscope.cn/organization/DataoceanAI)

# Repository Notice

This model is officially maintained by **Dataocean AI**.

To ensure compatibility with existing user code and download links, we keep two official repositories for the same model:

- Original / legacy repository: `DataoceanAI`
- Organization / enterprise repository: `DataoceanAI1`

Both repositories are maintained by the same team and contain the same model files.  
`DataoceanAI1` is the newly created enterprise organization account, while `DataoceanAI` is kept to avoid breaking existing user download scripts and links.

Please do not regard either repository as an unofficial copy or unauthorized redistribution.

**Dolphin-CN-Dialect** is a multi-dialect ASR model developed by Dataocean AI and Tsinghua University, with a strong focus on Chinese dialect recognition and real-world deployment scenarios. Compared with the previous Dolphin series, Dolphin-CN-Dialect introduces significant improvements in tokenizer design, dialect-balanced training, streaming capability, hotword biasing, and deployment efficiency.

The model supports Mandarin Chinese and 22 Chinese dialects, while also maintaining multilingual ASR capability inherited from Dolphin. Dolphin-CN-Dialect supports both streaming and non-streaming inference, enabling practical deployment in latency-sensitive applications such as real-time transcription and industrial speech recognition systems.


## Approach

Dolphin-CN-Dialect is built upon the Dolphin architecture and follows a joint CTC-Attention framework with:

* Encoder: E-Branchformer
* Decoder: Transformer Decoder
* Training Objective: Joint CTC + Attention loss

Compared to Dolphin, Dolphin-CN-Dialect introduces several important improvements:

* Temperature-based data sampling for balancing standard Mandarin and low-resource dialects
* Redesigned tokenizer with:
    * character-level modeling for Chinese
    * BPE-based subword modeling for English
    * extensible dialect tokens
* Streaming ASR support
* Hotword-biased decoding, including:
    * encoder-level contextual biasing
    * prompt-based decoder biasing

Experimental results show that Dolphin-CN-Dialect achieves:

* 38% improvement in dialect recognition accuracy
* 16.3% relative CER reduction over Dolphin
* Competitive performance with recent large-scale ASR systems while maintaining a smaller model size

![Dolphin-CN-Dialect 特色海报](Dolphin-CN-Dialect.png)


See details in the [Paper](https://arxiv.org/abs/2605.08961).


## Setup

Dolphin-CN-Dialect requires FFmpeg to convert audio files into WAV format. Please install FFmpeg first if it is not already installed on your system.

```shell
# Ubuntu / Debian
sudo apt update && sudo apt install ffmpeg
# MacOS
brew install ffmpeg
# Windows
choco install ffmpeg
```

Install Dolphin with pip:

```shell
pip install -U dolphin
```

Alternatively, install from source:

```shell
pip install git+https://github.com/DataoceanAI/Dolphin.git
```

## Available Models

Currently, Dolphin-CN-Dialect provides multiple model sizes optimized for different deployment scenarios.

|  Model  | Parameters  | Hotwords |
|:------:|:----------:|:----------:|
|  base.cn  |    0.1 B   | ❌ |
|  base.cn.streaming  |    0.1 B   |❌  |
| small.cn  |   0.4 B      | Encoder-biased Hotwords |
| small.cn.streaming  |   0.4 B      | Encoder-biased Hotwords |
| small.cn.prompt |   0.4 B      | Prompt-based Hotwords |


## Hotword Biasing

Dolphin-CN-Dialect supports two hotword biasing approaches.

**Encoder-Level Contextual Biasing**

* Supports both streaming and non-streaming models
* Integrates contextual embeddings into encoder representations
* Efficient adaptation without retraining the full model

**Prompt-Based Hotword Biasing**

* Designed for non-streaming models
* Injects hotwords directly into decoder prompts
* Particularly effective for long-tail and rare phrases

Experimental results show significant reductions in hotword error rates while maintaining strong overall ASR performance.



## Supported Languages and Dialects

Dolphin-CN-Dialect primarily focuses on:

* Mandarin Chinese
* 22 Chinese dialects
* Regional accented Mandarin

Supported dialects include:

* Sichuan
* Wu
* Minnan
* Shanghai
* Gansu
* Guangdong
* Wenzhou
* Hunan
* Anhui
* Henan
* Fujian
* Hebei
* Liaoning
* Shaanxi
* Tianjin
* and more

For the complete language and dialect list, see [languages.md](./languages.md).

## Supported Devices

| Device Type | Support Status |
|:-------------:|:----------------:|
|**CUDA**|✅Supported|
|**MPS (Apple)**|✅Supported|
|**CPU**|✅Supported|



## Usage

### Command-line usage

```shell
dolphin audio.wav

# Download model and specify the model path
dolphin audio.wav --model small.cn --model_dir /data/models/dolphin/

# Specify language and region
dolphin audio.wav --model small.cn --model_dir /data/models/dolphin/ --lang_sym "zh" --region_sym "CN"

# Specify the hotwords file with Encoder-biased method
dolphin audio.wav --model small.cn --model_dir /data/models/dolphin/ --hotword_list_path hotwords.txt --use_deep_biasing true

# Using prompt-based model
dolphin audio.wav --model small.cn.prompt --model_dir /data/models/dolphin/ --hotword_list_path hotwords.txt --use_prompt_hotword true --use_two_stage_filter true

```

### Python usage

```python
import dolphin
from dolphin import transcribe

model_name = 'small.cn'
model = dolphin.load_model(model_name, device="cuda")

result = transcribe(model, 'audio.wav')
print(result.text)

# Specify language
result = transcribe(model, 'audio.wav', lang_sym="zh")
print(result.text)

# Specify language and region and encoder-biased hotwords
result = transcribe(model, 'audio.wav', lang_sym="zh", region_sym="CN", hotwords=['诺香丹青牌科研胶囊'], use_deep_biasing=True, use_two_stage_filter=True)
print(result.text)

## prompt-based hotwords

model_name = 'small.cn.prompt'
model = dolphin.load_model(model_name, device="cuda")

result = transcribe(model, 'audio.wav', hotwords=['诺香丹青牌科研胶囊'], use_prompt_hotword=True, use_two_stage_filter=True, decoding_method='attention')

print(result.text)

```


## License

Dolphin-CN-Dialect is released under the Apache 2.0 License.