Chinese
speech
asr
DataoceanAI commited on
Commit
6cd7616
·
verified ·
1 Parent(s): ca1a42e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +200 -3
README.md CHANGED
@@ -1,3 +1,200 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Dolphin-Fangyan
2
+
3
+ [Paper](https://arxiv.org/abs/2503.20212)
4
+ [Github](https://github.com/DataoceanAI/Dolphin)
5
+ [Huggingface](https://huggingface.co/DataoceanAI)
6
+ [Modelscope](https://www.modelscope.cn/organization/DataoceanAI)
7
+ [Openi](https://openi.pcl.ac.cn/DataoceanAI/Dolphin)
8
+ [Wisemodel](https://wisemodel.cn/models/lijp22/dolphin-base)
9
+
10
+ **Dolphin-Fangyan** is a multi-dialect ASR model developed by Dataocean AI and Tsinghua University, with a strong focus on Chinese dialect recognition and real-world deployment scenarios. Compared with the previous Dolphin series, Dolphin-Fangyan introduces significant improvements in tokenizer design, dialect-balanced training, streaming capability, hotword biasing, and deployment efficiency.
11
+
12
+ The model supports Mandarin Chinese and 22 Chinese dialects, while also maintaining multilingual ASR capability inherited from Dolphin. Dolphin-Fangyan supports both streaming and non-streaming inference, enabling practical deployment in latency-sensitive applications such as real-time transcription and industrial speech recognition systems.
13
+
14
+
15
+ ## Approach
16
+
17
+ Dolphin-Fangyan is built upon the Dolphin architecture and follows a joint CTC-Attention framework with:
18
+
19
+ * Encoder: E-Branchformer
20
+ * Decoder: Transformer Decoder
21
+ * Training Objective: Joint CTC + Attention loss
22
+
23
+ Compared to Dolphin, Dolphin-Fangyan introduces several important improvements:
24
+
25
+ * Temperature-based data sampling for balancing standard Mandarin and low-resource dialects
26
+ * Redesigned tokenizer with:
27
+ * character-level modeling for Chinese
28
+ * BPE-based subword modeling for English
29
+ * extensible dialect tokens
30
+ * Streaming ASR support
31
+ * Hotword-biased decoding, including:
32
+ * encoder-level contextual biasing
33
+ * prompt-based decoder biasing
34
+
35
+ Experimental results show that Dolphin-Fangyan achieves:
36
+
37
+ * 38% improvement in dialect recognition accuracy
38
+ * 16.3% relative CER reduction over Dolphin
39
+ * Competitive performance with recent large-scale ASR systems while maintaining a smaller model size
40
+
41
+ See details in the [Paper](https://arxiv.org/abs/2503.20212).
42
+
43
+
44
+ ## Setup
45
+
46
+ Dolphin-Fangyan requires FFmpeg to convert audio files into WAV format. Please install FFmpeg first if it is not already installed on your system.
47
+
48
+ ```shell
49
+ # Ubuntu / Debian
50
+ sudo apt update && sudo apt install ffmpeg
51
+ # MacOS
52
+ brew install ffmpeg
53
+ # Windows
54
+ choco install ffmpeg
55
+ ```
56
+
57
+ Install Dolphin with pip:
58
+
59
+ ```shell
60
+ pip install -U dolphin
61
+ ```
62
+
63
+ Alternatively, install from source:
64
+
65
+ ```shell
66
+ pip install git+https://github.com/DataoceanAI/Dolphin.git
67
+ ```
68
+
69
+ ## Available Models
70
+
71
+ Currently, Dolphin-Fangyan provides multiple model sizes optimized for different deployment scenarios.
72
+
73
+ | Model | Parameters | Hotwords |
74
+ |:------:|:----------:|:----------:|
75
+ | base.fangyan | 74 M | ❌ |
76
+ | base.fangyan.streaming | 74 M |❌ |
77
+ | small.fangyan | 0.4 B | Encode-biased Hotwords |
78
+ | small.fangyan.streaming | 0.4 B | Encode-biased Hotwords |
79
+ | small.fangyan.prompt | 0.4 B | Prompt-based Hotwords |
80
+
81
+
82
+ ## Hotword Biasing
83
+
84
+ Dolphin-Fangyan supports two hotword biasing approaches.
85
+
86
+ **Encoder-Level Contextual Biasing**
87
+
88
+ * Supports both streaming and non-streaming models
89
+ * Integrates contextual embeddings into encoder representations
90
+ * Efficient adaptation without retraining the full model
91
+
92
+ **Prompt-Based Hotword Biasing**
93
+
94
+ * Designed for non-streaming models
95
+ * Injects hotwords directly into decoder prompts
96
+ * Particularly effective for long-tail and rare phrases
97
+
98
+ Experimental results show significant reductions in hotword error rates while maintaining strong overall ASR performance.
99
+
100
+
101
+
102
+ ## Supported Languages and Dialects
103
+
104
+ Dolphin-Fangyan primarily focuses on:
105
+
106
+ * Mandarin Chinese
107
+ * 22 Chinese dialects
108
+ * Regional accented Mandarin
109
+
110
+ Supported dialects include:
111
+
112
+ * Sichuan
113
+ * Wu
114
+ * Minnan
115
+ * Shanghai
116
+ * Gansu
117
+ * Guangdong
118
+ * Wenzhou
119
+ * Hunan
120
+ * Anhui
121
+ * Henan
122
+ * Fujian
123
+ * Hebei
124
+ * Liaoning
125
+ * Shaanxi
126
+ * Tianjin
127
+ * and more
128
+
129
+ For the complete language and dialect list, see [languages.md](./languages.md).
130
+
131
+ ## Supported Devices
132
+
133
+ | Device Type | Support Status |
134
+ |:-------------:|:----------------:|
135
+ |**CUDA**|✅Supported|
136
+ |**MPS (Apple)**|✅Supported|
137
+ |**Ascend NPU (Huawei)**|✅Supported|
138
+ |**CPU**|✅Supported|
139
+
140
+ To run Dolphin on Ascend NPU, you need to install the corresponding `torch_npu` package and configure the environment `ASCEND_RT_VISIBLE_DEVICES`. The tested configuration is: `CANN==8.0.1`, `torch==2.2.0`, `torch_npu==2.2.0`. With this setup, the model has been verified to run inference correctly on the Ascend NPU.
141
+
142
+
143
+
144
+ ## Usage
145
+
146
+ ### Command-line usage
147
+
148
+ ```shell
149
+ dolphin audio.wav
150
+
151
+ # Download model and specify the model path
152
+ dolphin audio.wav --model small.fangyan --model_dir /data/models/dolphin/
153
+
154
+ # Specify language and region
155
+ dolphin audio.wav --model small.fangyan --model_dir /data/models/dolphin/ --lang_sym "zh" --region_sym "CN"
156
+
157
+ # Specify the hotwords file with Encoder-biased method
158
+ dolphin audio.wav --model small.fangyan --model_dir /data/models/dolphin/ --hotword_list_path hotwords.txt --use_deep_biasing true
159
+
160
+ # Using prompt-based model
161
+ dolphin audio.wav --model small.fangyan.prompt --model_dir /data/models/dolphin/ --hotword_list_path hotwords.txt --use_prompt_hotword true --use_two_stage_filter true
162
+
163
+ ```
164
+
165
+ ### Python usage
166
+
167
+ ```python
168
+ import dolphin
169
+ from dolphin import transcribe
170
+
171
+ model_name = 'small.fangyan'
172
+ model = dolphin.load_model(model_name, f"/home/duhu/.cache/dolphin/{model_name}", "cpu")
173
+ # model = dolphin.load_model(model_name, f"/data/models/dolphin/{model_name}", "cpu")
174
+
175
+ result = transcribe(model, 'audio.wav')
176
+ print(result.text)
177
+
178
+ # Specify language
179
+ result = transcribe(model, 'audio.wav', lang_sym="zh")
180
+ print(result.text)
181
+
182
+ # Specify language and region and encoder-biased hotwords
183
+ result = transcribe(model, 'audio.wav', lang_sym="zh", region_sym="CN", hotwords=['诺香丹青牌科研胶囊'], use_deep_biasing=True, use_two_stage_filter=True)
184
+ print(result.text)
185
+
186
+ ## prompt-based hotwords
187
+
188
+ model_name = 'small.fangyan.prompt'
189
+ model = dolphin.load_model(model_name, f"/home/duhu/.cache/dolphin/{model_name}", "cpu")
190
+
191
+ result = transcribe(model, 'audio.wav', hotwords=['诺香丹青牌科研胶囊'], use_prompt_hotword=True, use_two_stage_filter=True, decoding_method='attention')
192
+
193
+ print(result.text)
194
+
195
+ ```
196
+
197
+
198
+ ## License
199
+
200
+ Dolphin-Fangyan is released under the Apache 2.0 License.