Batuka0901 commited on
Commit
c8eb5bd
·
verified ·
1 Parent(s): 0b73b4a

Upload WeNet Mongolian Conformer (model + TensorBoard + card)

Browse files
README.md ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - mn
4
+ license: apache-2.0
5
+ tags:
6
+ - automatic-speech-recognition
7
+ - speech
8
+ - wenet
9
+ - conformer
10
+ - mongolian
11
+ - mn
12
+ datasets:
13
+ - bilguun/fleurs-mn
14
+ metrics:
15
+ - cer
16
+ - wer
17
+ model-index:
18
+ - name: wenet-mn-conformer
19
+ results:
20
+ - task:
21
+ type: automatic-speech-recognition
22
+ name: Automatic Speech Recognition
23
+ dataset:
24
+ type: bilguun/fleurs-mn
25
+ name: FLEURS Mongolian
26
+ metrics:
27
+ - type: loss
28
+ value: 374.93737238103694
29
+ name: cv_loss (best epoch)
30
+ - type: accuracy
31
+ value: 0.25305086622635525
32
+ name: attention accuracy (best epoch)
33
+ - type: cer
34
+ value: 0.8696
35
+ name: CER on 3-example dev set
36
+ - type: wer
37
+ value: 1.0000
38
+ name: WER on 3-example dev set
39
+ ---
40
+
41
+ # WeNet Conformer — Mongolian (Монгол хэл)
42
+
43
+ WeNet U2++ Conformer model trained on [`bilguun/fleurs-mn`](https://huggingface.co/datasets/bilguun/fleurs-mn)
44
+ for Mongolian (Cyrillic) automatic speech recognition.
45
+
46
+ ## Model architecture
47
+
48
+ - **Encoder**: Conformer, 12 blocks × 256 dim, 4 heads
49
+ - **Decoder**: Bi-transformer (U2++), 3 L→R + 3 R→L blocks
50
+ - **Tokenizer**: char-level (38 Cyrillic tokens)
51
+ - **Loss**: CTC + Attention hybrid (ctc_weight=0.3, reverse_weight=0.3)
52
+
53
+ ## Training data
54
+
55
+ - **Dataset**: `bilguun/fleurs-mn`
56
+ - **Train**: 3,074 utterances · ~11.5 h
57
+ - **Test**: 949 utterances · ~2.85 h
58
+ - **Audio**: 16 kHz mono
59
+
60
+ ## Training results
61
+
62
+ - Epochs run: **100**
63
+ - Final train loss: **N/A**
64
+ - Final epoch: **99** — cv_loss **459.21**, acc **0.4066**
65
+ - Best epoch: **21** — cv_loss **374.94**, acc **0.2531**
66
+ - TensorBoard: this repo has a **TensorBoard** tab (see `runs/`).
67
+
68
+ ### Test-set metrics (attention rescoring, 3 held-out utterances)
69
+
70
+ - **Average CER: 86.96%**
71
+ - **Average WER: 100.00%**
72
+
73
+ <details><summary><b>Example 1 — Education_035_block_002.wav</b> &nbsp;CER 86.00% &nbsp;·&nbsp; WER 100.00%</summary>
74
+
75
+ **REF:** Байна уу? Чи мимоса уумаар байна уу? Би Холландех рүү явах гэж байна. Чамайг явах байхаа гэж бодоод. Намайг чамтай хамт мимоса ууна гэж бодоо юу? Яагаад болохгүй гэж? Бид хоёр нэг нэгэндээ
76
+
77
+ **HYP:** АААНААНАААААААНБАЙДЭЭ
78
+
79
+ </details>
80
+
81
+ <details><summary><b>Example 2 — part13_003_block_011.wav</b> &nbsp;CER 87.72% &nbsp;·&nbsp; WER 100.00%</summary>
82
+
83
+ **REF:** Боловсролынх нь стандарт аягүй олон юмнаас нөлөөлдөг байх аа. Тэгээд нөгөө бакалавраа аваад гараад ирчихсэн залуучуудыг ажлын байран дээр нь гаргаж ирэнгүүтээ дахиад бид нар өөрсдөө дахиж сургах ёстой.
84
+
85
+ **HYP:** АНАААААНАНААНАНААБАЙЖЭЭ
86
+
87
+ </details>
88
+
89
+ <details><summary><b>Example 3 — part16_006_block_055.wav</b> &nbsp;CER 87.15% &nbsp;·&nbsp; WER 100.00%</summary>
90
+
91
+ **REF:** Тийм ер нь бол бүгдийг нь одоо мэдэхгүй зарим нэг муу багш байгаа л байх л даа. Тэгэхдээ миний хувьд бол, харахад бол манайхан бол ер нь бол аягүй сайн сайн тус гаднаас одоо бүгдээрээ л жигд болчихсон за юу. Одоо энэ тэр гадн
92
+
93
+ **HYP:** АНААНАААНАААНАНАНАНБАЙДЭГ
94
+
95
+ </details>
96
+
97
+ ### Train + CV loss per 10 epochs
98
+
99
+ | Epoch | Step | train_loss | cv_loss | cv_loss_ctc | cv_loss_att | acc |
100
+ |-------|------|------------|---------|-------------|-------------|-----|
101
+ | 0 | 132 | — | 493.04 | 582.98 | 454.49 | 0.0854 |
102
+ | 10 | 1496 | — | 379.40 | 487.34 | 333.14 | 0.2473 |
103
+ | 20 | 2861 | — | 396.92 | 551.41 | 330.71 | 0.2412 |
104
+ | 30 | 4223 | — | 516.59 | 960.25 | 326.45 | 0.2599 |
105
+ | 40 | 5590 | — | 490.08 | 891.41 | 318.09 | 0.2764 |
106
+ | 50 | 6954 | — | 471.18 | 866.97 | 301.55 | 0.3215 |
107
+ | 60 | 8321 | — | 473.65 | 923.02 | 281.05 | 0.3756 |
108
+ | 70 | 9679 | — | 465.65 | 912.63 | 274.08 | 0.3659 |
109
+ | 80 | 11044 | — | 463.98 | 924.37 | 266.67 | 0.4058 |
110
+ | 90 | 12410 | — | 431.85 | 819.03 | 265.92 | 0.3977 |
111
+ | 99 | 13640 | — | 459.21 | 901.15 | 269.80 | 0.4066 |
112
+
113
+ ## Files
114
+
115
+ | File | Description |
116
+ |------|-------------|
117
+ | `avg_10.pt` | Best model (averaged top-10 checkpoints by default) |
118
+ | `train.yaml` | Training config |
119
+ | `lang_char.txt` | Character vocabulary (38 tokens) |
120
+ | `global_cmvn` | Feature normalization stats |
121
+ | `train.log` | Full training log |
122
+ | `runs/` | TensorBoard events |
123
+
124
+ ## Usage (WeNet)
125
+
126
+ ```bash
127
+ git clone https://github.com/wenet-e2e/wenet.git
128
+ cd wenet && pip install -e .
129
+
130
+ # Download model files from this repo, then:
131
+ python wenet/bin/recognize.py \
132
+ --config train.yaml \
133
+ --checkpoint avg_10.pt \
134
+ --dict lang_char.txt \
135
+ --test_data your_data.list \
136
+ --mode attention_rescoring \
137
+ --beam_size 10 \
138
+ --result_file result.txt
139
+ ```
140
+
141
+ ## Limitations
142
+
143
+ - Trained on ~11.5 h of FLEURS Mongolian — small-scale; WER/CER will be relatively high on out-of-domain speech.
144
+ - Only Cyrillic script supported; Latin characters and digits are stripped.
145
+ - No language model rescoring applied.
avg_10.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:103f9b09c32162b6108704f527c13d702381183c488e1987ecc573bbae8edc79
3
+ size 187539247
global_cmvn ADDED
@@ -0,0 +1 @@
 
 
1
+ {"mean_stat": [30923196.0, 32310116.0, 33415622.0, 36200568.0, 39474700.0, 41749368.0, 43107484.0, 43317052.0, 42227068.0, 40894508.0, 40888224.0, 41130580.0, 41935724.0, 42354480.0, 42387236.0, 42464544.0, 42346160.0, 42065056.0, 42460512.0, 41660492.0, 41050496.0, 41790484.0, 41091932.0, 41789372.0, 41694396.0, 42316184.0, 41963620.0, 42226152.0, 42113788.0, 42193772.0, 42357292.0, 42474108.0, 42636296.0, 42883404.0, 43228372.0, 43734832.0, 44355620.0, 44827140.0, 45197184.0, 45085852.0, 45590160.0, 45427712.0, 45821668.0, 45975316.0, 46224472.0, 46576584.0, 46857332.0, 47089112.0, 47371628.0, 47773048.0, 48137372.0, 48524612.0, 48760872.0, 48963484.0, 48938604.0, 49007124.0, 49286644.0, 49728124.0, 50140392.0, 50443820.0, 50951804.0, 51405272.0, 51815436.0, 52077764.0, 52385828.0, 52704224.0, 52932388.0, 53166064.0, 53354288.0, 53492524.0, 53507516.0, 53761996.0, 54232848.0, 54764608.0, 55353676.0, 55914684.0, 56588484.0, 57247388.0, 56693372.0, 50433548.0], "var_stat": [313056032.0, 340205280.0, 364214336.0, 406206496.0, 468360768.0, 516976000.0, 552321984.0, 564596608.0, 541072896.0, 505031264.0, 499250688.0, 503617376.0, 521025728.0, 531959200.0, 533065024.0, 532101952.0, 527271520.0, 519979328.0, 526128256.0, 508630976.0, 495443968.0, 509043648.0, 493818752.0, 505652672.0, 501634208.0, 512354144.0, 503553984.0, 508477504.0, 505764000.0, 506537952.0, 509665600.0, 512829504.0, 517067936.0, 521382656.0, 527926752.0, 538669568.0, 552602432.0, 562935104.0, 570335488.0, 567686208.0, 579041984.0, 575642880.0, 584228160.0, 587810240.0, 593443200.0, 600999168.0, 607060800.0, 612182400.0, 618473408.0, 627449280.0, 635429184.0, 644423232.0, 650121472.0, 655343872.0, 655346240.0, 656626304.0, 662780416.0, 672942976.0, 682552576.0, 690286528.0, 703312192.0, 714963584.0, 725732288.0, 732673792.0, 740466176.0, 748028736.0, 752881024.0, 758034624.0, 762378560.0, 766122496.0, 766731200.0, 773276608.0, 786042240.0, 799467520.0, 813984384.0, 828134976.0, 845798080.0, 863950976.0, 848231104.0, 678710080.0], "frame_num": 4146368}
lang_char.txt ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <blank> 0
2
+ <unk> 1
3
+ А 2
4
+ Н 3
5
+ Э 4
6
+ Г 5
7
+ Л 6
8
+ Р 7
9
+ О 8
10
+ Д 9
11
+ И 10
12
+ Й 11
13
+ Х 12
14
+ У 13
15
+ Т 14
16
+ С 15
17
+ Б 16
18
+ Ү 17
19
+ Ө 18
20
+ М 19
21
+ Ж 20
22
+ В 21
23
+ Ы 22
24
+ З 23
25
+ Ч 24
26
+ Ь 25
27
+ Е 26
28
+ Ц 27
29
+ Ш 28
30
+ К 29
31
+ Я 30
32
+ П 31
33
+ Ю 32
34
+ Ф 33
35
+ Ё 34
36
+ Ъ 35
37
+ Щ 36
38
+ <sos/eos> 37
runs/events.out.tfevents.1776340362.rookie-B650M-H-M-2 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7a84c54edf810ab421b310b2138c0c0c5548d94667a6619141e5b3b164dd62e7
3
+ size 4388750
train.log ADDED
The diff for this file is too large to render. See raw diff
 
train.yaml ADDED
@@ -0,0 +1,104 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ accum_grad: 1
2
+ cmvn: global_cmvn
3
+ cmvn_conf:
4
+ cmvn_file: data/train/global_cmvn
5
+ is_json_cmvn: true
6
+ ctc: ctc
7
+ ctc_conf:
8
+ ctc_blank_id: 0
9
+ dataset: asr
10
+ dataset_conf:
11
+ batch_conf:
12
+ batch_size: 16
13
+ batch_type: dynamic
14
+ max_frames_in_batch: 12000
15
+ fbank_conf:
16
+ dither: 0.1
17
+ frame_length: 25
18
+ frame_shift: 10
19
+ num_mel_bins: 80
20
+ filter_conf:
21
+ max_length: 40960
22
+ min_length: 1600
23
+ token_max_length: 200
24
+ token_min_length: 1
25
+ resample_conf:
26
+ resample_rate: 16000
27
+ shuffle: true
28
+ shuffle_conf:
29
+ shuffle_size: 1500
30
+ sort: true
31
+ sort_conf:
32
+ sort_size: 500
33
+ spec_aug: true
34
+ spec_aug_conf:
35
+ max_f: 10
36
+ max_t: 50
37
+ num_f_mask: 2
38
+ num_t_mask: 2
39
+ speed_perturb: true
40
+ decoder: bitransformer
41
+ decoder_conf:
42
+ attention_heads: 4
43
+ dropout_rate: 0.1
44
+ linear_units: 2048
45
+ num_blocks: 3
46
+ positional_dropout_rate: 0.1
47
+ r_num_blocks: 3
48
+ self_attention_dropout_rate: 0.0
49
+ src_attention_dropout_rate: 0.0
50
+ dtype: fp32
51
+ encoder: conformer
52
+ encoder_conf:
53
+ activation_type: swish
54
+ attention_dropout_rate: 0.0
55
+ attention_heads: 4
56
+ causal: false
57
+ cnn_module_kernel: 15
58
+ dropout_rate: 0.1
59
+ input_layer: conv2d
60
+ linear_units: 2048
61
+ normalize_before: true
62
+ num_blocks: 12
63
+ output_size: 256
64
+ pos_enc_layer_type: rel_pos
65
+ positional_dropout_rate: 0.1
66
+ selfattention_layer_type: rel_selfattn
67
+ use_cnn_module: true
68
+ use_dynamic_chunk: true
69
+ use_dynamic_left_chunk: false
70
+ grad_clip: 5.0
71
+ input_dim: 80
72
+ log_interval: 100
73
+ max_epoch: 100
74
+ model: asr_model
75
+ model_conf:
76
+ ctc_weight: 0.3
77
+ length_normalized_loss: false
78
+ lsm_weight: 0.1
79
+ reverse_weight: 0.3
80
+ model_dir: exp/conformer_mongolian
81
+ optim: adam
82
+ optim_conf:
83
+ lr: 0.002
84
+ output_dim: 38
85
+ save_states: model_only
86
+ scheduler: warmuplr
87
+ scheduler_conf:
88
+ warmup_steps: 25000
89
+ tokenizer: char
90
+ tokenizer_conf:
91
+ bpe_path: null
92
+ is_multilingual: false
93
+ non_lang_syms_path: null
94
+ num_languages: 1
95
+ special_tokens:
96
+ <blank>: 0
97
+ <eos>: 37
98
+ <sos>: 37
99
+ <unk>: 1
100
+ split_with_space: false
101
+ symbol_table_path: data/dict/lang_char.txt
102
+ train_engine: torch_ddp
103
+ use_amp: false
104
+ vocab_size: 38