File size: 5,243 Bytes
674d977 3b1b958 674d977 3b1b958 674d977 3b1b958 674d977 7b13d18 674d977 3b1b958 674d977 3b1b958 674d977 3b1b958 674d977 7b13d18 674d977 7b13d18 674d977 7b13d18 3b1b958 7b13d18 3b1b958 7b13d18 1caa944 7b13d18 1caa944 7b13d18 3b1b958 674d977 1caa944 7b13d18 674d977 1caa944 3b1b958 674d977 3b1b958 674d977 3b1b958 7b13d18 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | ---
language:
- ro
library_name: pytorch
pipeline_tag: video-text-to-text
tags:
- visual-speech-recognition
- lip-reading
- vsr
- romanian
- speech-recognition
- audio-visual
datasets:
- vsro200/vsro200
metrics:
- wer
---
# VSRo-200: Romanian Visual Speech Recognition Models
This repository hosts the encoder-decoder VSR model checkpoints introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.
The models are MultiVSR backbones fine-tuned on the **VSRo-200** corpus, a 200-hour collection of Romanian podcast recordings. For training code, data preparation scripts, and inference instructions, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).
## Checkpoints
All checkpoints follow the naming pattern `model_[hours]_[type].pt`:
- `_annot` — trained on human-annotated transcriptions
- `_auto` — trained on automatically generated pseudo-labels
- `_shuffle` — alternative data splits used for variance analysis (100h models)
- `_males` / `_females` / `_mix` — gender-controlled 40h subsets used for bias analysis
## Results
All results are reported in Word Error Rate (WER, %) and Character Error Rate (CER, %) on the **Test Unseen** and **Test Seen** splits. Lower is better.
#### Human Annotated Data
| Training Hours | Test Unseen WER (%) | Test Unseen CER (%) | Test Seen WER (%) | Test Seen CER (%) |
|:---:|:---:|:---:|:---:|:---:|
| 10h | 72.50 | 41.49 | 67.01 | 37.53 |
| 25h | 64.86 | 36.62 | 59.23 | 32.96 |
| 50h | 58.87 | 33.38 | 54.03 | 29.88 |
| 75h | 54.86 | 30.97 | 51.44 | 28.61 |
| 100h | **53.29** | **29.94** | **48.16** | **26.53** |
#### Whisper Pseudo Labels
| Training Hours | Test Unseen WER (%) | Test Unseen CER (%) | Test Seen WER (%) | Test Seen CER (%) |
|:---:|:---:|:---:|:---:|:---:|
| 10h | 74.61 | 42.09 | 68.41 | 38.22 |
| 25h | 66.27 | 37.05 | 60.40 | 33.36 |
| 50h | 59.28 | 33.15 | 55.39 | 30.65 |
| 75h | 56.25 | 31.18 | 51.56 | 28.33 |
| 100h | 53.63 | 30.12 | 49.61 | 27.22 |
| 125h | 51.71 | 29.04 | 48.68 | 26.58 |
| 150h | 51.25 | 28.40 | 47.05 | 25.64 |
| 175h | 49.84 | 27.66 | 46.44 | 25.30 |
| 200h | **48.75** | **27.05** | **44.54** | **24.51** |
A variance analysis across three random shuffles of the 100h subsets yields a mean Word Error Rate (WER) of 53.21% (± 0.37) for the human-annotated data and 53.82% (± 0.17) for the auto-generated data.
### Out-of-distribution robustness
* **Test Seen / Unseen (In-Domain):** Baseline performance on podcast data, tested on our 200h-model.
* **Vlogs:** Unconstrained videos shot in different camera angles, dynamic lighting, movement.
* **Specific domains:** Content featuring highly specialized or technical vocabulary (e.g., medical, scientific).
* **Noisy:** Videos with poor resolution, bad lighting, or heavy motion blur.
* **Archival (Black & White):** Historical footage with distinct visual artifacts, atypical framerates, and lack of color information.
* **Global OOD:** The aggregated metrics across all out-of-distribution subsets.
| Dataset / Category | # Clips | WER (%) | CER (%) | OOV Token (%) | OOV Type (%) |
|:---|:---:|:---:|:---:|:---:|:---:|
| **Test Seen** | 386 | 44.54 | 24.51 | 1.67 | 6.93 |
| **Test Unseen** | 389 | 48.75 | 27.05 | 2.30 | 8.50 |
| **OOD: Vlogs** | 99 | 58.61 | 32.85 | 1.49 | 4.26 |
| **OOD: Specific domains** | 84 | 63.01 | 28.73 | 9.78 | 17.93 |
| **OOD: Noisy** | 100 | 68.96 | 33.68 | 6.19 | 12.88 |
| **OOD: Archival** | 92 | 87.97 | 50.44 | 5.24 | 10.96 |
| **Global OOD** | 375 | 68.46 | 35.99 | 5.08 | 14.75 |
#### Metrics Note
* **Duration:** Each OOD category consists of 15 minutes of video content.
* **OOV Token (%):** The percentage of *total words* in the evaluation set that do not appear in the training data. Measures how often unknown words occur.
* **OOV Type (%):** The percentage of *unique words* in the evaluation set that do not appear in the training data. Measures the diversity of unknown words.
### Gender bias analysis (40h models)
To evaluate gender bias and cross-speaker generalization, we trained 40-hour baseline models on male-only, female-only, and mixed datasets.
#### Test Unseen
| Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| Males Only | 62.15 | 35.23 | 61.32 | 34.51 | 62.97 | 35.95 |
| Females Only | **59.33** | **33.44** | **59.17** | **32.87** | **59.49** | **34.02** |
| Mixed Data | 59.52 | 33.74 | 59.19 | 33.26 | 59.85 | 34.22 |
#### Test Seen
| Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| Males Only | 58.82 | 33.11 | **58.58** | **32.59** | 59.06 | 33.63 |
| Females Only | 59.10 | 33.30 | 67.26 | 38.67 | **51.20** | **27.99** |
| Mixed Data | **56.29** | **31.22** | 60.56 | 33.54 | 52.15 | 28.93 |
## Citation
If you use these models, please cite:
```bibtex
@inproceedings{vsro200,
title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
author = {...},
year = {...}
}
``` |