File size: 5,243 Bytes

---
language:
- ro
library_name: pytorch
pipeline_tag: video-text-to-text
tags:
- visual-speech-recognition
- lip-reading
- vsr
- romanian
- speech-recognition
- audio-visual
datasets:
- vsro200/vsro200
metrics:
- wer
---

# VSRo-200: Romanian Visual Speech Recognition Models

This repository hosts the encoder-decoder VSR model checkpoints introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.

The models are MultiVSR backbones fine-tuned on the **VSRo-200** corpus, a 200-hour collection of Romanian podcast recordings. For training code, data preparation scripts, and inference instructions, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).

## Checkpoints

All checkpoints follow the naming pattern `model_[hours]_[type].pt`:

- `_annot` — trained on human-annotated transcriptions
- `_auto` — trained on automatically generated pseudo-labels
- `_shuffle` — alternative data splits used for variance analysis (100h models)
- `_males` / `_females` / `_mix` — gender-controlled 40h subsets used for bias analysis

## Results

All results are reported in Word Error Rate (WER, %) and Character Error Rate (CER, %) on the **Test Unseen** and **Test Seen** splits. Lower is better.

#### Human Annotated Data

| Training Hours | Test Unseen WER (%) | Test Unseen CER (%) | Test Seen WER (%) | Test Seen CER (%) |
|:---:|:---:|:---:|:---:|:---:|
| 10h | 72.50 | 41.49 | 67.01 | 37.53 |
| 25h | 64.86 | 36.62 | 59.23 | 32.96 |
| 50h | 58.87 | 33.38 | 54.03 | 29.88 |
| 75h | 54.86 | 30.97 | 51.44 | 28.61 |
| 100h | **53.29** | **29.94** | **48.16** | **26.53** |

#### Whisper Pseudo Labels

| Training Hours | Test Unseen WER (%) | Test Unseen CER (%) | Test Seen WER (%) | Test Seen CER (%) |
|:---:|:---:|:---:|:---:|:---:|
| 10h | 74.61 | 42.09 | 68.41 | 38.22 |
| 25h | 66.27 | 37.05 | 60.40 | 33.36 |
| 50h | 59.28 | 33.15 | 55.39 | 30.65 |
| 75h | 56.25 | 31.18 | 51.56 | 28.33 |
| 100h | 53.63 | 30.12 | 49.61 | 27.22 |
| 125h | 51.71 | 29.04 | 48.68 | 26.58 |
| 150h | 51.25 | 28.40 | 47.05 | 25.64 |
| 175h | 49.84 | 27.66 | 46.44 | 25.30 |
| 200h | **48.75** | **27.05** | **44.54** | **24.51** |

A variance analysis across three random shuffles of the 100h subsets yields a mean Word Error Rate (WER) of 53.21% (± 0.37) for the human-annotated data and 53.82% (± 0.17) for the auto-generated data.



### Out-of-distribution robustness

*   **Test Seen / Unseen (In-Domain):** Baseline performance on podcast data, tested on our 200h-model.
*   **Vlogs:** Unconstrained videos shot in different camera angles, dynamic lighting, movement.
*   **Specific domains:** Content featuring highly specialized or technical vocabulary (e.g., medical, scientific). 
*   **Noisy:** Videos with poor resolution, bad lighting, or heavy motion blur.
*   **Archival (Black & White):** Historical footage with distinct visual artifacts, atypical framerates, and lack of color information.
*   **Global OOD:** The aggregated metrics across all out-of-distribution subsets.

| Dataset / Category | # Clips | WER (%) | CER (%) | OOV Token (%) | OOV Type (%) |
|:---|:---:|:---:|:---:|:---:|:---:|
| **Test Seen** | 386 | 44.54 | 24.51 | 1.67 | 6.93 |
| **Test Unseen** | 389 | 48.75 | 27.05 | 2.30 | 8.50 |
| **OOD: Vlogs** | 99 | 58.61 | 32.85 | 1.49 | 4.26 |
| **OOD: Specific domains** | 84 | 63.01 | 28.73 | 9.78 | 17.93 |
| **OOD: Noisy** | 100 | 68.96 | 33.68 | 6.19 | 12.88 |
| **OOD: Archival** | 92 | 87.97 | 50.44 | 5.24 | 10.96 |
| **Global OOD** | 375 | 68.46 | 35.99 | 5.08 | 14.75 |

#### Metrics Note
*   **Duration:** Each OOD category consists of 15 minutes of video content.
*   **OOV Token (%):** The percentage of *total words* in the evaluation set that do not appear in the training data. Measures how often unknown words occur.
*   **OOV Type (%):** The percentage of *unique words* in the evaluation set that do not appear in the training data. Measures the diversity of unknown words.

### Gender bias analysis (40h models)


To evaluate gender bias and cross-speaker generalization, we trained 40-hour baseline models on male-only, female-only, and mixed datasets. 

#### Test Unseen
| Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| Males Only | 62.15 | 35.23 | 61.32 | 34.51 | 62.97 | 35.95 |
| Females Only | **59.33** | **33.44** | **59.17** | **32.87** | **59.49** | **34.02** |
| Mixed Data | 59.52 | 33.74 | 59.19 | 33.26 | 59.85 | 34.22 |

#### Test Seen
| Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| Males Only | 58.82 | 33.11 | **58.58** | **32.59** | 59.06 | 33.63 |
| Females Only | 59.10 | 33.30 | 67.26 | 38.67 | **51.20** | **27.99** |
| Mixed Data | **56.29** | **31.22** | 60.56 | 33.54 | 52.15 | 28.93 |



## Citation

If you use these models, please cite:

```bibtex
@inproceedings{vsro200,
  title  = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
  author = {...},
  year   = {...}
}
```