Update README.md

1caa944 verified 3 days ago

5.24 kB

language:
  - ro
library_name: pytorch
pipeline_tag: video-text-to-text
tags:
  - visual-speech-recognition
  - lip-reading
  - vsr
  - romanian
  - speech-recognition
  - audio-visual
datasets:
  - vsro200/vsro200
metrics:
  - wer

VSRo-200: Romanian Visual Speech Recognition Models

This repository hosts the encoder-decoder VSR model checkpoints introduced in the paper VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness.

The models are MultiVSR backbones fine-tuned on the VSRo-200 corpus, a 200-hour collection of Romanian podcast recordings. For training code, data preparation scripts, and inference instructions, please refer to the GitHub repository.

Checkpoints

All checkpoints follow the naming pattern model_[hours]_[type].pt:

_annot — trained on human-annotated transcriptions
_auto — trained on automatically generated pseudo-labels
_shuffle — alternative data splits used for variance analysis (100h models)
_males / _females / _mix — gender-controlled 40h subsets used for bias analysis

Results

All results are reported in Word Error Rate (WER, %) and Character Error Rate (CER, %) on the Test Unseen and Test Seen splits. Lower is better.

Human Annotated Data

Training Hours	Test Unseen WER (%)	Test Unseen CER (%)	Test Seen WER (%)	Test Seen CER (%)
10h	72.50	41.49	67.01	37.53
25h	64.86	36.62	59.23	32.96
50h	58.87	33.38	54.03	29.88
75h	54.86	30.97	51.44	28.61
100h	53.29	29.94	48.16	26.53

Whisper Pseudo Labels

Training Hours	Test Unseen WER (%)	Test Unseen CER (%)	Test Seen WER (%)	Test Seen CER (%)
10h	74.61	42.09	68.41	38.22
25h	66.27	37.05	60.40	33.36
50h	59.28	33.15	55.39	30.65
75h	56.25	31.18	51.56	28.33
100h	53.63	30.12	49.61	27.22
125h	51.71	29.04	48.68	26.58
150h	51.25	28.40	47.05	25.64
175h	49.84	27.66	46.44	25.30
200h	48.75	27.05	44.54	24.51

A variance analysis across three random shuffles of the 100h subsets yields a mean Word Error Rate (WER) of 53.21% (± 0.37) for the human-annotated data and 53.82% (± 0.17) for the auto-generated data.

Out-of-distribution robustness

Test Seen / Unseen (In-Domain): Baseline performance on podcast data, tested on our 200h-model.
Vlogs: Unconstrained videos shot in different camera angles, dynamic lighting, movement.
Specific domains: Content featuring highly specialized or technical vocabulary (e.g., medical, scientific).
Noisy: Videos with poor resolution, bad lighting, or heavy motion blur.
Archival (Black & White): Historical footage with distinct visual artifacts, atypical framerates, and lack of color information.
Global OOD: The aggregated metrics across all out-of-distribution subsets.

Dataset / Category	# Clips	WER (%)	CER (%)	OOV Token (%)	OOV Type (%)
Test Seen	386	44.54	24.51	1.67	6.93
Test Unseen	389	48.75	27.05	2.30	8.50
OOD: Vlogs	99	58.61	32.85	1.49	4.26
OOD: Specific domains	84	63.01	28.73	9.78	17.93
OOD: Noisy	100	68.96	33.68	6.19	12.88
OOD: Archival	92	87.97	50.44	5.24	10.96
Global OOD	375	68.46	35.99	5.08	14.75

Metrics Note

Duration: Each OOD category consists of 15 minutes of video content.
OOV Token (%): The percentage of total words in the evaluation set that do not appear in the training data. Measures how often unknown words occur.
OOV Type (%): The percentage of unique words in the evaluation set that do not appear in the training data. Measures the diversity of unknown words.

Gender bias analysis (40h models)

To evaluate gender bias and cross-speaker generalization, we trained 40-hour baseline models on male-only, female-only, and mixed datasets.

Test Unseen

Training Set (40h)	Global WER (%)	Global CER (%)	Male WER (%)	Male CER (%)	Female WER (%)	Female CER (%)
Males Only	62.15	35.23	61.32	34.51	62.97	35.95
Females Only	59.33	33.44	59.17	32.87	59.49	34.02
Mixed Data	59.52	33.74	59.19	33.26	59.85	34.22

Test Seen

Training Set (40h)	Global WER (%)	Global CER (%)	Male WER (%)	Male CER (%)	Female WER (%)	Female CER (%)
Males Only	58.82	33.11	58.58	32.59	59.06	33.63
Females Only	59.10	33.30	67.26	38.67	51.20	27.99
Mixed Data	56.29	31.22	60.56	33.54	52.15	28.93

Citation

If you use these models, please cite:

@inproceedings{vsro200,
  title  = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
  author = {...},
  year   = {...}
}