DataoceanAI1
/

dolphin-cn-dialect-small-streaming

+---
+license: apache-2.0
+language:
+- zh
+tags:
+- speech
+- asr
+frameworks:
+- pytorch
+---
+# Dolphin-Fangyan
+[Paper](https://arxiv.org/abs/2503.20212)
+[Github](https://github.com/DataoceanAI/Dolphin)
+[Huggingface](https://huggingface.co/DataoceanAI)
+[Modelscope](https://www.modelscope.cn/organization/DataoceanAI)
+[Openi](https://openi.pcl.ac.cn/DataoceanAI/Dolphin)
+[Wisemodel](https://wisemodel.cn/models/lijp22/dolphin-base)
+**Dolphin-Fangyan** is a multi-dialect ASR model developed by Dataocean AI and Tsinghua University, with a strong focus on Chinese dialect recognition and real-world deployment scenarios. Compared with the previous Dolphin series, Dolphin-Fangyan introduces significant improvements in tokenizer design, dialect-balanced training, streaming capability, hotword biasing, and deployment efficiency.
+The model supports Mandarin Chinese and 22 Chinese dialects, while also maintaining multilingual ASR capability inherited from Dolphin. Dolphin-Fangyan supports both streaming and non-streaming inference, enabling practical deployment in latency-sensitive applications such as real-time transcription and industrial speech recognition systems.
+## Approach
+Dolphin-Fangyan is built upon the Dolphin architecture and follows a joint CTC-Attention framework with:
+* Encoder: E-Branchformer
+* Decoder: Transformer Decoder
+* Training Objective: Joint CTC + Attention loss
+Compared to Dolphin, Dolphin-Fangyan introduces several important improvements:
+* Temperature-based data sampling for balancing standard Mandarin and low-resource dialects
+* Redesigned tokenizer with:
+    * character-level modeling for Chinese
+    * BPE-based subword modeling for English
+    * extensible dialect tokens
+* Streaming ASR support
+* Hotword-biased decoding, including:
+    * encoder-level contextual biasing
+    * prompt-based decoder biasing
+Experimental results show that Dolphin-Fangyan achieves:
+* 38% improvement in dialect recognition accuracy
+* 16.3% relative CER reduction over Dolphin
+* Competitive performance with recent large-scale ASR systems while maintaining a smaller model size
+![Dolphin-FangYan 特色海报](dolphin_fangyan_feature_poster_v3.png)
+See details in the [Paper](https://arxiv.org/abs/2503.20212).
+## Setup
+Dolphin-Fangyan requires FFmpeg to convert audio files into WAV format. Please install FFmpeg first if it is not already installed on your system.
+```shell
+# Ubuntu / Debian
+sudo apt update && sudo apt install ffmpeg
+# MacOS
+brew install ffmpeg
+# Windows
+choco install ffmpeg
+```
+Install Dolphin with pip:
+```shell
+pip install -U dolphin
+```
+Alternatively, install from source:
+```shell
+pip install git+https://github.com/DataoceanAI/Dolphin.git
+```
+## Available Models
+Currently, Dolphin-Fangyan provides multiple model sizes optimized for different deployment scenarios.
+|  Model  | Parameters  | Hotwords |
+|:------:|:----------:|:----------:|
+|  base.fangyan  |    74 M   | ❌ |
+|  base.fangyan.streaming  |    74 M   |❌  |
+| small.fangyan  |   0.4 B      | Encode-biased Hotwords |
+| small.fangyan.streaming  |   0.4 B      | Encode-biased Hotwords |
+| small.fangyan.prompt |   0.4 B      | Prompt-based Hotwords |
+## Hotword Biasing
+Dolphin-Fangyan supports two hotword biasing approaches.
+**Encoder-Level Contextual Biasing**
+* Supports both streaming and non-streaming models
+* Integrates contextual embeddings into encoder representations
+* Efficient adaptation without retraining the full model
+**Prompt-Based Hotword Biasing**
+* Designed for non-streaming models
+* Injects hotwords directly into decoder prompts
+* Particularly effective for long-tail and rare phrases
+Experimental results show significant reductions in hotword error rates while maintaining strong overall ASR performance.
+## Supported Languages and Dialects
+Dolphin-Fangyan primarily focuses on:
+* Mandarin Chinese
+* 22 Chinese dialects
+* Regional accented Mandarin
+Supported dialects include:
+* Sichuan
+* Wu
+* Minnan
+* Shanghai
+* Gansu
+* Guangdong
+* Wenzhou
+* Hunan
+* Anhui
+* Henan
+* Fujian
+* Hebei
+* Liaoning
+* Shaanxi
+* Tianjin
+* and more
+For the complete language and dialect list, see [languages.md](./languages.md).
+## Supported Devices
+| Device Type | Support Status |
+|:-------------:|:----------------:|
+|**CUDA**|✅Supported|
+|**MPS (Apple)**|✅Supported|
+|**Ascend NPU (Huawei)**|✅Supported|
+|**CPU**|✅Supported|
+To run Dolphin on Ascend NPU, you need to install the corresponding `torch_npu` package and  configure the environment `ASCEND_RT_VISIBLE_DEVICES`. The tested configuration is: `CANN==8.0.1`, `torch==2.2.0`, `torch_npu==2.2.0`. With this setup, the model has been verified to run inference correctly on the Ascend NPU.
+## Usage
+### Command-line usage
+```shell
+dolphin audio.wav
+# Download model and specify the model path
+dolphin audio.wav --model small.fangyan --model_dir /data/models/dolphin/
+# Specify language and region
+dolphin audio.wav --model small.fangyan --model_dir /data/models/dolphin/ --lang_sym "zh" --region_sym "CN"
+# Specify the hotwords file with Encoder-biased method
+dolphin audio.wav --model small.fangyan --model_dir /data/models/dolphin/ --hotword_list_path hotwords.txt --use_deep_biasing true
+# Using prompt-based model
+dolphin audio.wav --model small.fangyan.prompt --model_dir /data/models/dolphin/ --hotword_list_path hotwords.txt --use_prompt_hotword true --use_two_stage_filter true
+```
+### Python usage
+```python
+import dolphin
+from dolphin import transcribe
+model_name = 'small.fangyan'
+model = dolphin.load_model(model_name, f"/home/duhu/.cache/dolphin/{model_name}", "cpu")
+# model = dolphin.load_model(model_name, f"/data/models/dolphin/{model_name}", "cpu")
+result = transcribe(model, 'audio.wav')
+print(result.text)
+# Specify language
+result = transcribe(model, 'audio.wav', lang_sym="zh")
+print(result.text)
+# Specify language and region and encoder-biased hotwords
+result = transcribe(model, 'audio.wav', lang_sym="zh", region_sym="CN", hotwords=['诺香丹青牌科研胶囊'], use_deep_biasing=True, use_two_stage_filter=True)
+print(result.text)
+## prompt-based hotwords
+model_name = 'small.fangyan.prompt'
+model = dolphin.load_model(model_name, f"/home/duhu/.cache/dolphin/{model_name}", "cpu")
+result = transcribe(model, 'audio.wav', hotwords=['诺香丹青牌科研胶囊'], use_prompt_hotword=True, use_two_stage_filter=True, decoding_method='attention')
+print(result.text)
+```
+## License
+Dolphin-Fangyan is released under the Apache 2.0 License.