metadata
language:
- mn
license: apache-2.0
tags:
- automatic-speech-recognition
- speech
- wenet
- conformer
- mongolian
- mn
datasets:
- google/fleurs-mn
metrics:
- cer
- wer
model-index:
- name: wenet-mn-conformer
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
type: google/fleurs-mn
name: FLEURS Mongolian
metrics:
- type: loss
value: 374.93737238103694
name: cv_loss (best epoch)
- type: accuracy
value: 0.25305086622635525
name: attention accuracy (best epoch)
- type: cer
value: 0.8696
name: CER on 3-example dev set
- type: wer
value: 1
name: WER on 3-example dev set
WeNet Conformer — Mongolian (Монгол хэл)
WeNet U2++ Conformer model trained on google/fleurs-mn
for Mongolian (Cyrillic) automatic speech recognition.
Model architecture
- Encoder: Conformer, 12 blocks × 256 dim, 4 heads
- Decoder: Bi-transformer (U2++), 3 L→R + 3 R→L blocks
- Tokenizer: char-level (38 Cyrillic tokens)
- Loss: CTC + Attention hybrid (ctc_weight=0.3, reverse_weight=0.3)
Training data
- Dataset:
google/fleurs-mn - Train: 3,074 utterances · ~11.5 h
- Test: 949 utterances · ~2.85 h
- Audio: 16 kHz mono
Training results
- Epochs run: 100
- Final train loss: N/A
- Final epoch: 99 — cv_loss N/A, acc N/A
- Best epoch: 21 — cv_loss N/A, acc N/A
- TensorBoard: this repo has a TensorBoard tab (see
runs/).
Files
| File | Description |
|---|---|
avg_10.pt |
Best model (averaged top-10 checkpoints by default) |
train.yaml |
Training config |
lang_char.txt |
Character vocabulary (38 tokens) |
global_cmvn |
Feature normalization stats |
train.log |
Full training log |
runs/ |
TensorBoard events |
Download model files from this repo, then:
python wenet/bin/recognize.py
--config train.yaml
--checkpoint avg_10.pt
--dict lang_char.txt
--test_data your_data.list
--mode attention_rescoring
--beam_size 10
--result_file result.txt
## Limitations
- Trained on ~11.5 h of FLEURS Mongolian — small-scale; WER/CER will be relatively high on out-of-domain speech.
- Only Cyrillic script supported; Latin characters and digits are stripped.
- No language model rescoring applied.