language:
- ro
library_name: pytorch
pipeline_tag: video-text-to-text
tags:
- visual-speech-recognition
- lip-reading
- vsr
- romanian
- speech-recognition
- audio-visual
datasets:
- vsro200/vsro200
metrics:
- wer
VSRo-200: Romanian Visual Speech Recognition Models
This repository hosts the encoder-decoder VSR model checkpoints introduced in the paper VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness.
The models are MultiVSR backbones fine-tuned on the VSRo-200 corpus, a 200-hour collection of Romanian podcast recordings. For training code, data preparation scripts, and inference instructions, please refer to the GitHub repository.
Checkpoints
All checkpoints follow the naming pattern model_[hours]_[type].pt:
_annot— trained on human-annotated transcriptions_auto— trained on automatically generated pseudo-labels_shuffle— alternative data splits used for variance analysis (100h models)_males/_females/_mix— gender-controlled 40h subsets used for bias analysis
Results
All results are reported in Word Error Rate (WER, %) and Character Error Rate (CER, %) on the Test Unseen and Test Seen splits. Lower is better.
Human Annotated Data
| Training Hours | Test Unseen WER (%) | Test Unseen CER (%) | Test Seen WER (%) | Test Seen CER (%) |
|---|---|---|---|---|
| 10h | 72.50 | 41.49 | 67.01 | 37.53 |
| 25h | 64.86 | 36.62 | 59.23 | 32.96 |
| 50h | 58.87 | 33.38 | 54.03 | 29.88 |
| 75h | 54.86 | 30.97 | 51.44 | 28.61 |
| 100h | 53.29 | 29.94 | 48.16 | 26.53 |
Whisper Pseudo Labels
| Training Hours | Test Unseen WER (%) | Test Unseen CER (%) | Test Seen WER (%) | Test Seen CER (%) |
|---|---|---|---|---|
| 10h | 74.61 | 42.09 | 68.41 | 38.22 |
| 25h | 66.27 | 37.05 | 60.40 | 33.36 |
| 50h | 59.28 | 33.15 | 55.39 | 30.65 |
| 75h | 56.25 | 31.18 | 51.56 | 28.33 |
| 100h | 53.63 | 30.12 | 49.61 | 27.22 |
| 125h | 51.71 | 29.04 | 48.68 | 26.58 |
| 150h | 51.25 | 28.40 | 47.05 | 25.64 |
| 175h | 49.84 | 27.66 | 46.44 | 25.30 |
| 200h | 48.75 | 27.05 | 44.54 | 24.51 |
A variance analysis across three random shuffles of the 100h subsets yields a mean Word Error Rate (WER) of 53.21% (± 0.37) for the human-annotated data and 53.82% (± 0.17) for the auto-generated data.
Out-of-distribution robustness
- Test Seen / Unseen (In-Domain): Baseline performance on podcast data, tested on our 200h-model.
- Vlogs: Unconstrained videos shot in different camera angles, dynamic lighting, movement.
- Specific domains: Content featuring highly specialized or technical vocabulary (e.g., medical, scientific).
- Noisy: Videos with poor resolution, bad lighting, or heavy motion blur.
- Archival (Black & White): Historical footage with distinct visual artifacts, atypical framerates, and lack of color information.
- Global OOD: The aggregated metrics across all out-of-distribution subsets.
| Dataset / Category | # Clips | WER (%) | CER (%) | OOV Token (%) | OOV Type (%) |
|---|---|---|---|---|---|
| Test Seen | 386 | 44.54 | 24.51 | 1.67 | 6.93 |
| Test Unseen | 389 | 48.75 | 27.05 | 2.30 | 8.50 |
| OOD: Vlogs | 99 | 58.61 | 32.85 | 1.49 | 4.26 |
| OOD: Specific domains | 84 | 63.01 | 28.73 | 9.78 | 17.93 |
| OOD: Noisy | 100 | 68.96 | 33.68 | 6.19 | 12.88 |
| OOD: Archival | 92 | 87.97 | 50.44 | 5.24 | 10.96 |
| Global OOD | 375 | 68.46 | 35.99 | 5.08 | 14.75 |
Metrics Note
- Duration: Each OOD category consists of 15 minutes of video content.
- OOV Token (%): The percentage of total words in the evaluation set that do not appear in the training data. Measures how often unknown words occur.
- OOV Type (%): The percentage of unique words in the evaluation set that do not appear in the training data. Measures the diversity of unknown words.
Gender bias analysis (40h models)
To evaluate gender bias and cross-speaker generalization, we trained 40-hour baseline models on male-only, female-only, and mixed datasets.
Test Unseen
| Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) |
|---|---|---|---|---|---|---|
| Males Only | 62.15 | 35.23 | 61.32 | 34.51 | 62.97 | 35.95 |
| Females Only | 59.33 | 33.44 | 59.17 | 32.87 | 59.49 | 34.02 |
| Mixed Data | 59.52 | 33.74 | 59.19 | 33.26 | 59.85 | 34.22 |
Test Seen
| Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) |
|---|---|---|---|---|---|---|
| Males Only | 58.82 | 33.11 | 58.58 | 32.59 | 59.06 | 33.63 |
| Females Only | 59.10 | 33.30 | 67.26 | 38.67 | 51.20 | 27.99 |
| Mixed Data | 56.29 | 31.22 | 60.56 | 33.54 | 52.15 | 28.93 |
Citation
If you use these models, please cite:
@inproceedings{vsro200,
title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
author = {...},
year = {...}
}