| --- |
| language: |
| - mn |
| license: apache-2.0 |
| tags: |
| - automatic-speech-recognition |
| - speech |
| - wenet |
| - conformer |
| - mongolian |
| - mn |
| datasets: |
| - google/fleurs-mn |
| metrics: |
| - cer |
| - wer |
| model-index: |
| - name: wenet-mn-conformer |
| results: |
| - task: |
| type: automatic-speech-recognition |
| name: Automatic Speech Recognition |
| dataset: |
| type: google/fleurs-mn |
| name: FLEURS Mongolian |
| metrics: |
| - type: loss |
| value: 374.93737238103694 |
| name: cv_loss (best epoch) |
| - type: accuracy |
| value: 0.25305086622635525 |
| name: attention accuracy (best epoch) |
| - type: cer |
| value: 0.8696 |
| name: CER on 3-example dev set |
| - type: wer |
| value: 1.0000 |
| name: WER on 3-example dev set |
| --- |
| |
| # WeNet Conformer — Mongolian (Монгол хэл) |
|
|
| WeNet U2++ Conformer model trained on [`google/fleurs-mn`](https://huggingface.co/datasets/google/fleurs) |
| for Mongolian (Cyrillic) automatic speech recognition. |
|
|
| ## Model architecture |
|
|
| - **Encoder**: Conformer, 12 blocks × 256 dim, 4 heads |
| - **Decoder**: Bi-transformer (U2++), 3 L→R + 3 R→L blocks |
| - **Tokenizer**: char-level (38 Cyrillic tokens) |
| - **Loss**: CTC + Attention hybrid (ctc_weight=0.3, reverse_weight=0.3) |
|
|
| ## Training data |
|
|
| - **Dataset**: `google/fleurs-mn` |
| - **Train**: 3,074 utterances · ~11.5 h |
| - **Test**: 949 utterances · ~2.85 h |
| - **Audio**: 16 kHz mono |
|
|
| ## Training results |
|
|
| - Epochs run: **100** |
| - Final train loss: **N/A** |
| - Final epoch: **99** — cv_loss **N/A**, acc **N/A** |
| - Best epoch: **21** — cv_loss **N/A**, acc **N/A** |
| - TensorBoard: this repo has a **TensorBoard** tab (see `runs/`). |
|
|
|
|
|
|
| ## Files |
|
|
| | File | Description | |
| |------|-------------| |
| | `avg_10.pt` | Best model (averaged top-10 checkpoints by default) | |
| | `train.yaml` | Training config | |
| | `lang_char.txt` | Character vocabulary (38 tokens) | |
| | `global_cmvn` | Feature normalization stats | |
| | `train.log` | Full training log | |
| | `runs/` | TensorBoard events | |
|
|
| # Download model files from this repo, then: |
| python wenet/bin/recognize.py \ |
| --config train.yaml \ |
| --checkpoint avg_10.pt \ |
| --dict lang_char.txt \ |
| --test_data your_data.list \ |
| --mode attention_rescoring \ |
| --beam_size 10 \ |
| --result_file result.txt |
| ``` |
| |
| ## Limitations |
|
|
| - Trained on ~11.5 h of FLEURS Mongolian — small-scale; WER/CER will be relatively high on out-of-domain speech. |
| - Only Cyrillic script supported; Latin characters and digits are stripped. |
| - No language model rescoring applied. |
|
|