| --- |
| language: |
| - ro |
| library_name: pytorch |
| pipeline_tag: video-text-to-text |
| tags: |
| - visual-speech-recognition |
| - lip-reading |
| - vsr |
| - romanian |
| - speech-recognition |
| - audio-visual |
| datasets: |
| - vsro200/vsro200 |
| metrics: |
| - wer |
| --- |
| |
| # VSRo-200: Romanian Visual Speech Recognition Models |
|
|
| This repository hosts the encoder-decoder VSR model checkpoints introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*. |
|
|
| The models are MultiVSR backbones fine-tuned on the **VSRo-200** corpus, a 200-hour collection of Romanian podcast recordings. For training code, data preparation scripts, and inference instructions, please refer to the [GitHub repository](https://github.com/vsro200/vsro200). |
|
|
| ## Checkpoints |
|
|
| All checkpoints follow the naming pattern `model_[hours]_[type].pt`: |
|
|
| - `_annot` — trained on human-annotated transcriptions |
| - `_auto` — trained on automatically generated pseudo-labels |
| - `_shuffle` — alternative data splits used for variance analysis (100h models) |
| - `_males` / `_females` / `_mix` — gender-controlled 40h subsets used for bias analysis |
|
|
| ## Results |
|
|
| All results are reported in Word Error Rate (WER, %) and Character Error Rate (CER, %) on the **Test Unseen** and **Test Seen** splits. Lower is better. |
|
|
| #### Human Annotated Data |
|
|
| | Training Hours | Test Unseen WER (%) | Test Unseen CER (%) | Test Seen WER (%) | Test Seen CER (%) | |
| |:---:|:---:|:---:|:---:|:---:| |
| | 10h | 72.50 | 41.49 | 67.01 | 37.53 | |
| | 25h | 64.86 | 36.62 | 59.23 | 32.96 | |
| | 50h | 58.87 | 33.38 | 54.03 | 29.88 | |
| | 75h | 54.86 | 30.97 | 51.44 | 28.61 | |
| | 100h | **53.29** | **29.94** | **48.16** | **26.53** | |
|
|
| #### Whisper Pseudo Labels |
|
|
| | Training Hours | Test Unseen WER (%) | Test Unseen CER (%) | Test Seen WER (%) | Test Seen CER (%) | |
| |:---:|:---:|:---:|:---:|:---:| |
| | 10h | 74.61 | 42.09 | 68.41 | 38.22 | |
| | 25h | 66.27 | 37.05 | 60.40 | 33.36 | |
| | 50h | 59.28 | 33.15 | 55.39 | 30.65 | |
| | 75h | 56.25 | 31.18 | 51.56 | 28.33 | |
| | 100h | 53.63 | 30.12 | 49.61 | 27.22 | |
| | 125h | 51.71 | 29.04 | 48.68 | 26.58 | |
| | 150h | 51.25 | 28.40 | 47.05 | 25.64 | |
| | 175h | 49.84 | 27.66 | 46.44 | 25.30 | |
| | 200h | **48.75** | **27.05** | **44.54** | **24.51** | |
|
|
| A variance analysis across three random shuffles of the 100h subsets yields a mean Word Error Rate (WER) of 53.21% (± 0.37) for the human-annotated data and 53.82% (± 0.17) for the auto-generated data. |
|
|
|
|
|
|
| ### Out-of-distribution robustness |
|
|
| * **Test Seen / Unseen (In-Domain):** Baseline performance on podcast data, tested on our 200h-model. |
| * **Vlogs:** Unconstrained videos shot in different camera angles, dynamic lighting, movement. |
| * **Specific domains:** Content featuring highly specialized or technical vocabulary (e.g., medical, scientific). |
| * **Noisy:** Videos with poor resolution, bad lighting, or heavy motion blur. |
| * **Archival (Black & White):** Historical footage with distinct visual artifacts, atypical framerates, and lack of color information. |
| * **Global OOD:** The aggregated metrics across all out-of-distribution subsets. |
|
|
| | Dataset / Category | # Clips | WER (%) | CER (%) | OOV Token (%) | OOV Type (%) | |
| |:---|:---:|:---:|:---:|:---:|:---:| |
| | **Test Seen** | 386 | 44.54 | 24.51 | 1.67 | 6.93 | |
| | **Test Unseen** | 389 | 48.75 | 27.05 | 2.30 | 8.50 | |
| | **OOD: Vlogs** | 99 | 58.61 | 32.85 | 1.49 | 4.26 | |
| | **OOD: Specific domains** | 84 | 63.01 | 28.73 | 9.78 | 17.93 | |
| | **OOD: Noisy** | 100 | 68.96 | 33.68 | 6.19 | 12.88 | |
| | **OOD: Archival** | 92 | 87.97 | 50.44 | 5.24 | 10.96 | |
| | **Global OOD** | 375 | 68.46 | 35.99 | 5.08 | 14.75 | |
|
|
| #### Metrics Note |
| * **Duration:** Each OOD category consists of 15 minutes of video content. |
| * **OOV Token (%):** The percentage of *total words* in the evaluation set that do not appear in the training data. Measures how often unknown words occur. |
| * **OOV Type (%):** The percentage of *unique words* in the evaluation set that do not appear in the training data. Measures the diversity of unknown words. |
|
|
| ### Gender bias analysis (40h models) |
|
|
|
|
| To evaluate gender bias and cross-speaker generalization, we trained 40-hour baseline models on male-only, female-only, and mixed datasets. |
|
|
| #### Test Unseen |
| | Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) | |
| |:---|:---:|:---:|:---:|:---:|:---:|:---:| |
| | Males Only | 62.15 | 35.23 | 61.32 | 34.51 | 62.97 | 35.95 | |
| | Females Only | **59.33** | **33.44** | **59.17** | **32.87** | **59.49** | **34.02** | |
| | Mixed Data | 59.52 | 33.74 | 59.19 | 33.26 | 59.85 | 34.22 | |
|
|
| #### Test Seen |
| | Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) | |
| |:---|:---:|:---:|:---:|:---:|:---:|:---:| |
| | Males Only | 58.82 | 33.11 | **58.58** | **32.59** | 59.06 | 33.63 | |
| | Females Only | 59.10 | 33.30 | 67.26 | 38.67 | **51.20** | **27.99** | |
| | Mixed Data | **56.29** | **31.22** | 60.56 | 33.54 | 52.15 | 28.93 | |
|
|
|
|
|
|
| ## Citation |
|
|
| If you use these models, please cite: |
|
|
| ```bibtex |
| @inproceedings{vsro200, |
| title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness}, |
| author = {...}, |
| year = {...} |
| } |
| ``` |