upload model

Browse files

Files changed (7) hide show

.gitattributes +1 -0
README.md +222 -3
base.fangyan.pt +3 -0
dolphin_fangyan_feature_poster_v3.png +3 -0
global_cmvn +1 -0
train.yaml +108 -0
units.txt +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+dolphin_fangyan_feature_poster_v3.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,222 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- zh
+tags:
+- speech
+- asr
+frameworks:
+- pytorch
+---
+# Dolphin-Fangyan
+[Paper](https://arxiv.org/abs/2503.20212)
+[Github](https://github.com/DataoceanAI/Dolphin)
+[Huggingface](https://huggingface.co/DataoceanAI)
+[Modelscope](https://www.modelscope.cn/organization/DataoceanAI)
+[Openi](https://openi.pcl.ac.cn/DataoceanAI/Dolphin)
+[Wisemodel](https://wisemodel.cn/models/lijp22/dolphin-base)
+**Dolphin-Fangyan** is a multi-dialect ASR model developed by Dataocean AI and Tsinghua University, with a strong focus on Chinese dialect recognition and real-world deployment scenarios. Compared with the previous Dolphin series, Dolphin-Fangyan introduces significant improvements in tokenizer design, dialect-balanced training, streaming capability, hotword biasing, and deployment efficiency.
+The model supports Mandarin Chinese and 22 Chinese dialects, while also maintaining multilingual ASR capability inherited from Dolphin. Dolphin-Fangyan supports both streaming and non-streaming inference, enabling practical deployment in latency-sensitive applications such as real-time transcription and industrial speech recognition systems.
+## Approach
+Dolphin-Fangyan is built upon the Dolphin architecture and follows a joint CTC-Attention framework with:
+* Encoder: E-Branchformer
+* Decoder: Transformer Decoder
+* Training Objective: Joint CTC + Attention loss
+Compared to Dolphin, Dolphin-Fangyan introduces several important improvements:
+* Temperature-based data sampling for balancing standard Mandarin and low-resource dialects
+* Redesigned tokenizer with:
+    * character-level modeling for Chinese
+    * BPE-based subword modeling for English
+    * extensible dialect tokens
+* Streaming ASR support
+* Hotword-biased decoding, including:
+    * encoder-level contextual biasing
+    * prompt-based decoder biasing
+Experimental results show that Dolphin-Fangyan achieves:
+* 38% improvement in dialect recognition accuracy
+* 16.3% relative CER reduction over Dolphin
+* Competitive performance with recent large-scale ASR systems while maintaining a smaller model size
+![Dolphin-FangYan 特色海报](dolphin_fangyan_feature_poster_v3.png)
+See details in the [Paper](https://arxiv.org/abs/2503.20212).
+## Setup
+Dolphin-Fangyan requires FFmpeg to convert audio files into WAV format. Please install FFmpeg first if it is not already installed on your system.
+```shell
+# Ubuntu / Debian
+sudo apt update && sudo apt install ffmpeg
+# MacOS
+brew install ffmpeg
+# Windows
+choco install ffmpeg
+```
+Install Dolphin with pip:
+```shell
+pip install -U dolphin
+```
+Alternatively, install from source:
+```shell
+pip install git+https://github.com/DataoceanAI/Dolphin.git
+```
+## Available Models
+Currently, Dolphin-Fangyan provides multiple model sizes optimized for different deployment scenarios.
+|  Model  | Parameters  | Hotwords |
+|:------:|:----------:|:----------:|
+|  base.fangyan  |    0.1 B   | ❌ |
+|  base.fangyan.streaming  |    0.1 B   |❌  |
+| small.fangyan  |   0.4 B      | Encoder-biased Hotwords |
+| small.fangyan.streaming  |   0.4 B      | Encoder-biased Hotwords |
+| small.fangyan.prompt |   0.4 B      | Prompt-based Hotwords |
+## Hotword Biasing
+Dolphin-Fangyan supports two hotword biasing approaches.
+**Encoder-Level Contextual Biasing**
+* Supports both streaming and non-streaming models
+* Integrates contextual embeddings into encoder representations
+* Efficient adaptation without retraining the full model
+**Prompt-Based Hotword Biasing**
+* Designed for non-streaming models
+* Injects hotwords directly into decoder prompts
+* Particularly effective for long-tail and rare phrases
+Experimental results show significant reductions in hotword error rates while maintaining strong overall ASR performance.
+## Supported Languages and Dialects
+Dolphin-Fangyan primarily focuses on:
+* Mandarin Chinese
+* 22 Chinese dialects
+* Regional accented Mandarin
+Supported dialects include:
+* Sichuan
+* Wu
+* Minnan
+* Shanghai
+* Gansu
+* Guangdong
+* Wenzhou
+* Hunan
+* Anhui
+* Henan
+* Fujian
+* Hebei
+* Liaoning
+* Shaanxi
+* Tianjin
+* and more
+For the complete language and dialect list, see [languages.md](./languages.md).
+## Supported Devices
+| Device Type | Support Status |
+|:-------------:|:----------------:|
+|**CUDA**|✅Supported|
+|**MPS (Apple)**|✅Supported|
+|**Ascend NPU (Huawei)**|✅Supported|
+|**CPU**|✅Supported|
+To run Dolphin on Ascend NPU, you need to install the corresponding `torch_npu` package and  configure the environment `ASCEND_RT_VISIBLE_DEVICES`. The tested configuration is: `CANN==8.0.1`, `torch==2.2.0`, `torch_npu==2.2.0`. With this setup, the model has been verified to run inference correctly on the Ascend NPU.
+## Usage
+### Command-line usage
+```shell
+dolphin audio.wav
+# Download model and specify the model path
+dolphin audio.wav --model small.fangyan --model_dir /data/models/dolphin/
+# Specify language and region
+dolphin audio.wav --model small.fangyan --model_dir /data/models/dolphin/ --lang_sym "zh" --region_sym "CN"
+# Specify the hotwords file with Encoder-biased method
+dolphin audio.wav --model small.fangyan --model_dir /data/models/dolphin/ --hotword_list_path hotwords.txt --use_deep_biasing true
+# Using prompt-based model
+dolphin audio.wav --model small.fangyan.prompt --model_dir /data/models/dolphin/ --hotword_list_path hotwords.txt --use_prompt_hotword true --use_two_stage_filter true
+```
+### Python usage
+```python
+import dolphin
+from dolphin import transcribe
+model_name = 'small.fangyan'
+model = dolphin.load_model(model_name, f"/data/models/dolphin/{model_name}", "cuda")
+result = transcribe(model, 'audio.wav')
+print(result.text)
+# Specify language
+result = transcribe(model, 'audio.wav', lang_sym="zh")
+print(result.text)
+# Specify language and region and encoder-biased hotwords
+result = transcribe(model, 'audio.wav', lang_sym="zh", region_sym="CN", hotwords=['诺香丹青牌科研胶囊'], use_deep_biasing=True, use_two_stage_filter=True)
+print(result.text)
+## prompt-based hotwords
+model_name = 'small.fangyan.prompt'
+model = dolphin.load_model(model_name, f"/data/models/dolphin/{model_name}", "cuda")
+result = transcribe(model, 'audio.wav', hotwords=['诺香丹青牌科研胶囊'], use_prompt_hotword=True, use_two_stage_filter=True, decoding_method='attention')
+print(result.text)
+```
+## License
+Dolphin-Fangyan is released under the Apache 2.0 License.

base.fangyan.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6c7a746c225f0f406053c9ebbdced7b79cfb91051d8060da3f1a26aa7913648b
+size 447175723

dolphin_fangyan_feature_poster_v3.png ADDED Viewed

Git LFS Details

SHA256: fd337571c3573b3aeff3891045074b2efe3a98017452e9fdd8fcfd986bb2c884
Pointer size: 131 Bytes
Size of remote file: 587 kB

global_cmvn ADDED Viewed

	@@ -0,0 +1 @@

+ {"mean_stat": [533749120.0, 537379776.0, 553561472.0, 587164544.0, 631869696.0, 662598848.0, 684377024.0, 695393728.0, 692471168.0, 679433984.0, 666123200.0, 656323712.0, 665752576.0, 678693440.0, 681920896.0, 679622080.0, 669891840.0, 656595136.0, 653838528.0, 637679232.0, 628412096.0, 644836864.0, 638840960.0, 646180608.0, 639724352.0, 642756992.0, 637471744.0, 642369856.0, 643414976.0, 647382848.0, 649348672.0, 649294336.0, 650233920.0, 654485056.0, 660473792.0, 667416512.0, 673158464.0, 675675200.0, 675123648.0, 668017536.0, 670060160.0, 662626240.0, 663143808.0, 662504064.0, 666413696.0, 672262080.0, 678483904.0, 685386048.0, 692572416.0, 699064000.0, 700785280.0, 701202688.0, 702666560.0, 705441664.0, 706070720.0, 705989248.0, 702842816.0, 699316416.0, 696090176.0, 687561152.0, 675279808.0, 663676352.0, 662962880.0, 664298944.0, 666095808.0, 671681664.0, 676652224.0, 680097152.0, 683811072.0, 688700992.0, 692082880.0, 695787904.0, 701085376.0, 706388736.0, 711491584.0, 717637248.0, 719691456.0, 715812736.0, 696362624.0, 604648448.0], "var_stat": [5413307392.0, 5559845888.0, 6150984704.0, 6921248256.0, 7999779840.0, 8789867520.0, 9405782016.0, 9768041472.0, 9759789056.0, 9430661120.0, 9090545664.0, 8873148416.0, 9155918848.0, 9542536192.0, 9653540864.0, 9593434112.0, 9316643840.0, 8959277056.0, 8863545344.0, 8450634752.0, 8211585536.0, 8587086336.0, 8432618496.0, 8583947264.0, 8401719808.0, 8439344640.0, 8293782528.0, 8401505280.0, 8427503104.0, 8525163520.0, 8577082880.0, 8575110656.0, 8594999296.0, 8701685760.0, 8854966272.0, 9029483520.0, 9168757760.0, 9221463040.0, 9194539008.0, 8997074944.0, 9024589824.0, 8819394560.0, 8807888896.0, 8777241600.0, 8869670912.0, 9017397248.0, 9173403648.0, 9345572864.0, 9530641408.0, 9701232640.0, 9748996096.0, 9762760704.0, 9801994240.0, 9874428928.0, 9883272192.0, 9873506304.0, 9780680704.0, 9672627200.0, 9569440768.0, 9321866240.0, 8968148992.0, 8646342656.0, 8616977408.0, 8648623104.0, 8702088192.0, 8859208704.0, 8999405568.0, 9105936384.0, 9220425728.0, 9358615552.0, 9451428864.0, 9552728064.0, 9695461376.0, 9836660736.0, 9970957312.0, 10135880704.0, 10189387776.0, 10070480896.0, 9532967936.0, 7261238272.0], "frame_num": 54068199}

train.yaml ADDED Viewed

	@@ -0,0 +1,108 @@

+accum_grad: 4
+cmvn: global_cmvn
+cmvn_conf:
+  cmvn_file: data/train/global_cmvn
+  is_json_cmvn: true
+ctc: ctc
+ctc_conf:
+  ctc_blank_id: 0
+dataset: asr
+dataset_conf:
+  batch_conf:
+    batch_size: 32
+    batch_type: static
+  ctc_label: true
+  cycle: 100
+  fbank_conf:
+    dither: 0.1
+    frame_length: 25
+    frame_shift: 10
+    num_mel_bins: 80
+  filter_conf:
+    max_length: 3000
+    min_length: 0
+    token_max_length: 200
+    token_min_length: 1
+  no_time_idx: 3
+  remove_punctuation: true
+  remove_timestamp: true
+  resample_conf:
+    resample_rate: 16000
+  shuffle: true
+  shuffle_conf:
+    shuffle_size: 5120
+  sort: true
+  sort_conf:
+    sort_size: 2048
+  spec_aug: true
+  spec_aug_conf:
+    max_f: 10
+    max_t: 50
+    num_f_mask: 2
+    num_t_mask: 2
+  speed_perturb: true
+  time_apply_prob: 0.0
+decoder: transformer
+decoder_conf:
+  attention_heads: 8
+  dropout_rate: 0.1
+  linear_units: 2048
+  num_blocks: 6
+  positional_dropout_rate: 0.1
+  self_attention_dropout_rate: 0.1
+  src_attention_dropout_rate: 0.1
+  use_sdpa: true
+dtype: fp32
+encoder: e_branchformer
+encoder_conf:
+  activation_type: swish
+  attention_dropout_rate: 0.1
+  attention_heads: 8
+  causal: false
+  cgmlp_conv_kernel: 31
+  cgmlp_linear_units: 2048
+  dropout_rate: 0.1
+  gate_activation: identity
+  input_layer: conv2d
+  linear_units: 2048
+  merge_conv_kernel: 31
+  num_blocks: 6
+  output_size: 512
+  pos_enc_layer_type: rel_pos
+  positional_dropout_rate: 0.1
+  selfattention_layer_type: rel_selfattn
+  use_linear_after_conv: false
+  use_sdpa: true
+grad_clip: 5
+input_dim: 80
+log_interval: 200
+max_epoch: 100
+model: asr_model
+model_conf:
+  ctc_weight: 0.3
+  length_normalized_loss: false
+  lsm_weight: 0.1
+model_dir: exp/dolphin_ebf_base_nonstreaming_v4.3
+optim: adam
+optim_conf:
+  lr: 0.0005
+output_dim: 18173
+save_interval: 2000
+save_states: model_only
+scheduler: warmuplr
+scheduler_conf:
+  warmup_steps: 2048
+stats_dialect: true
+tokenizer: char
+tokenizer_conf:
+  special_tokens:
+    <asr>: 4
+    <blank>: 0
+    <eos>: 3
+    <sos>: 2
+    <unk>: 1
+  split_with_space: false
+  symbol_table_path: data/dict/units.txt
+train_engine: torch_ddp
+use_amp: false
+vocab_size: 18173

units.txt ADDED Viewed

The diff for this file is too large to render. See raw diff