models-vsro200 / README.md
vsro200's picture
Update README.md
1caa944 verified
metadata
language:
  - ro
library_name: pytorch
pipeline_tag: video-text-to-text
tags:
  - visual-speech-recognition
  - lip-reading
  - vsr
  - romanian
  - speech-recognition
  - audio-visual
datasets:
  - vsro200/vsro200
metrics:
  - wer

VSRo-200: Romanian Visual Speech Recognition Models

This repository hosts the encoder-decoder VSR model checkpoints introduced in the paper VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness.

The models are MultiVSR backbones fine-tuned on the VSRo-200 corpus, a 200-hour collection of Romanian podcast recordings. For training code, data preparation scripts, and inference instructions, please refer to the GitHub repository.

Checkpoints

All checkpoints follow the naming pattern model_[hours]_[type].pt:

  • _annot — trained on human-annotated transcriptions
  • _auto — trained on automatically generated pseudo-labels
  • _shuffle — alternative data splits used for variance analysis (100h models)
  • _males / _females / _mix — gender-controlled 40h subsets used for bias analysis

Results

All results are reported in Word Error Rate (WER, %) and Character Error Rate (CER, %) on the Test Unseen and Test Seen splits. Lower is better.

Human Annotated Data

Training Hours Test Unseen WER (%) Test Unseen CER (%) Test Seen WER (%) Test Seen CER (%)
10h 72.50 41.49 67.01 37.53
25h 64.86 36.62 59.23 32.96
50h 58.87 33.38 54.03 29.88
75h 54.86 30.97 51.44 28.61
100h 53.29 29.94 48.16 26.53

Whisper Pseudo Labels

Training Hours Test Unseen WER (%) Test Unseen CER (%) Test Seen WER (%) Test Seen CER (%)
10h 74.61 42.09 68.41 38.22
25h 66.27 37.05 60.40 33.36
50h 59.28 33.15 55.39 30.65
75h 56.25 31.18 51.56 28.33
100h 53.63 30.12 49.61 27.22
125h 51.71 29.04 48.68 26.58
150h 51.25 28.40 47.05 25.64
175h 49.84 27.66 46.44 25.30
200h 48.75 27.05 44.54 24.51

A variance analysis across three random shuffles of the 100h subsets yields a mean Word Error Rate (WER) of 53.21% (± 0.37) for the human-annotated data and 53.82% (± 0.17) for the auto-generated data.

Out-of-distribution robustness

  • Test Seen / Unseen (In-Domain): Baseline performance on podcast data, tested on our 200h-model.
  • Vlogs: Unconstrained videos shot in different camera angles, dynamic lighting, movement.
  • Specific domains: Content featuring highly specialized or technical vocabulary (e.g., medical, scientific).
  • Noisy: Videos with poor resolution, bad lighting, or heavy motion blur.
  • Archival (Black & White): Historical footage with distinct visual artifacts, atypical framerates, and lack of color information.
  • Global OOD: The aggregated metrics across all out-of-distribution subsets.
Dataset / Category # Clips WER (%) CER (%) OOV Token (%) OOV Type (%)
Test Seen 386 44.54 24.51 1.67 6.93
Test Unseen 389 48.75 27.05 2.30 8.50
OOD: Vlogs 99 58.61 32.85 1.49 4.26
OOD: Specific domains 84 63.01 28.73 9.78 17.93
OOD: Noisy 100 68.96 33.68 6.19 12.88
OOD: Archival 92 87.97 50.44 5.24 10.96
Global OOD 375 68.46 35.99 5.08 14.75

Metrics Note

  • Duration: Each OOD category consists of 15 minutes of video content.
  • OOV Token (%): The percentage of total words in the evaluation set that do not appear in the training data. Measures how often unknown words occur.
  • OOV Type (%): The percentage of unique words in the evaluation set that do not appear in the training data. Measures the diversity of unknown words.

Gender bias analysis (40h models)

To evaluate gender bias and cross-speaker generalization, we trained 40-hour baseline models on male-only, female-only, and mixed datasets.

Test Unseen

Training Set (40h) Global WER (%) Global CER (%) Male WER (%) Male CER (%) Female WER (%) Female CER (%)
Males Only 62.15 35.23 61.32 34.51 62.97 35.95
Females Only 59.33 33.44 59.17 32.87 59.49 34.02
Mixed Data 59.52 33.74 59.19 33.26 59.85 34.22

Test Seen

Training Set (40h) Global WER (%) Global CER (%) Male WER (%) Male CER (%) Female WER (%) Female CER (%)
Males Only 58.82 33.11 58.58 32.59 59.06 33.63
Females Only 59.10 33.30 67.26 38.67 51.20 27.99
Mixed Data 56.29 31.22 60.56 33.54 52.15 28.93

Citation

If you use these models, please cite:

@inproceedings{vsro200,
  title  = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
  author = {...},
  year   = {...}
}