Update README.md

1caa944 verified 3 days ago

5.24 kB

	---
	language:
	- ro
	library_name: pytorch
	pipeline_tag: video-text-to-text
	tags:
	- visual-speech-recognition
	- lip-reading
	- vsr
	- romanian
	- speech-recognition
	- audio-visual
	datasets:
	- vsro200/vsro200
	metrics:
	- wer
	---

	# VSRo-200: Romanian Visual Speech Recognition Models

	This repository hosts the encoder-decoder VSR model checkpoints introduced in the paper VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness.

	The models are MultiVSR backbones fine-tuned on the VSRo-200 corpus, a 200-hour collection of Romanian podcast recordings. For training code, data preparation scripts, and inference instructions, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).

	## Checkpoints

	All checkpoints follow the naming pattern `model_[hours]_[type].pt`:

	- `_annot` — trained on human-annotated transcriptions
	- `_auto` — trained on automatically generated pseudo-labels
	- `_shuffle` — alternative data splits used for variance analysis (100h models)
	- `_males` / `_females` / `_mix` — gender-controlled 40h subsets used for bias analysis

	## Results

	All results are reported in Word Error Rate (WER, %) and Character Error Rate (CER, %) on the Test Unseen and Test Seen splits. Lower is better.

	#### Human Annotated Data

	\| Training Hours \| Test Unseen WER (%) \| Test Unseen CER (%) \| Test Seen WER (%) \| Test Seen CER (%) \|
	\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| 10h \| 72.50 \| 41.49 \| 67.01 \| 37.53 \|
	\| 25h \| 64.86 \| 36.62 \| 59.23 \| 32.96 \|
	\| 50h \| 58.87 \| 33.38 \| 54.03 \| 29.88 \|
	\| 75h \| 54.86 \| 30.97 \| 51.44 \| 28.61 \|
	\| 100h \| 53.29 \| 29.94 \| 48.16 \| 26.53 \|

	#### Whisper Pseudo Labels

	\| Training Hours \| Test Unseen WER (%) \| Test Unseen CER (%) \| Test Seen WER (%) \| Test Seen CER (%) \|
	\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| 10h \| 74.61 \| 42.09 \| 68.41 \| 38.22 \|
	\| 25h \| 66.27 \| 37.05 \| 60.40 \| 33.36 \|
	\| 50h \| 59.28 \| 33.15 \| 55.39 \| 30.65 \|
	\| 75h \| 56.25 \| 31.18 \| 51.56 \| 28.33 \|
	\| 100h \| 53.63 \| 30.12 \| 49.61 \| 27.22 \|
	\| 125h \| 51.71 \| 29.04 \| 48.68 \| 26.58 \|
	\| 150h \| 51.25 \| 28.40 \| 47.05 \| 25.64 \|
	\| 175h \| 49.84 \| 27.66 \| 46.44 \| 25.30 \|
	\| 200h \| 48.75 \| 27.05 \| 44.54 \| 24.51 \|

	A variance analysis across three random shuffles of the 100h subsets yields a mean Word Error Rate (WER) of 53.21% (± 0.37) for the human-annotated data and 53.82% (± 0.17) for the auto-generated data.



	### Out-of-distribution robustness

	* Test Seen / Unseen (In-Domain): Baseline performance on podcast data, tested on our 200h-model.
	* Vlogs: Unconstrained videos shot in different camera angles, dynamic lighting, movement.
	* Specific domains: Content featuring highly specialized or technical vocabulary (e.g., medical, scientific).
	* Noisy: Videos with poor resolution, bad lighting, or heavy motion blur.
	* Archival (Black & White): Historical footage with distinct visual artifacts, atypical framerates, and lack of color information.
	* Global OOD: The aggregated metrics across all out-of-distribution subsets.

	\| Dataset / Category \| # Clips \| WER (%) \| CER (%) \| OOV Token (%) \| OOV Type (%) \|
	\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| Test Seen \| 386 \| 44.54 \| 24.51 \| 1.67 \| 6.93 \|
	\| Test Unseen \| 389 \| 48.75 \| 27.05 \| 2.30 \| 8.50 \|
	\| OOD: Vlogs \| 99 \| 58.61 \| 32.85 \| 1.49 \| 4.26 \|
	\| OOD: Specific domains \| 84 \| 63.01 \| 28.73 \| 9.78 \| 17.93 \|
	\| OOD: Noisy \| 100 \| 68.96 \| 33.68 \| 6.19 \| 12.88 \|
	\| OOD: Archival \| 92 \| 87.97 \| 50.44 \| 5.24 \| 10.96 \|
	\| Global OOD \| 375 \| 68.46 \| 35.99 \| 5.08 \| 14.75 \|

	#### Metrics Note
	* Duration: Each OOD category consists of 15 minutes of video content.
	* OOV Token (%): The percentage of total words in the evaluation set that do not appear in the training data. Measures how often unknown words occur.
	* OOV Type (%): The percentage of unique words in the evaluation set that do not appear in the training data. Measures the diversity of unknown words.

	### Gender bias analysis (40h models)


	To evaluate gender bias and cross-speaker generalization, we trained 40-hour baseline models on male-only, female-only, and mixed datasets.

	#### Test Unseen
	\| Training Set (40h) \| Global WER (%) \| Global CER (%) \| Male WER (%) \| Male CER (%) \| Female WER (%) \| Female CER (%) \|
	\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| Males Only \| 62.15 \| 35.23 \| 61.32 \| 34.51 \| 62.97 \| 35.95 \|
	\| Females Only \| 59.33 \| 33.44 \| 59.17 \| 32.87 \| 59.49 \| 34.02 \|
	\| Mixed Data \| 59.52 \| 33.74 \| 59.19 \| 33.26 \| 59.85 \| 34.22 \|

	#### Test Seen
	\| Training Set (40h) \| Global WER (%) \| Global CER (%) \| Male WER (%) \| Male CER (%) \| Female WER (%) \| Female CER (%) \|
	\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| Males Only \| 58.82 \| 33.11 \| 58.58 \| 32.59 \| 59.06 \| 33.63 \|
	\| Females Only \| 59.10 \| 33.30 \| 67.26 \| 38.67 \| 51.20 \| 27.99 \|
	\| Mixed Data \| 56.29 \| 31.22 \| 60.56 \| 33.54 \| 52.15 \| 28.93 \|



	## Citation

	If you use these models, please cite:

	```bibtex
	@inproceedings{vsro200,
	title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
	author = {...},
	year = {...}
	}
	```

	---
	language:
	- ro
	library_name: pytorch
	pipeline_tag: video-text-to-text
	tags:
	- visual-speech-recognition
	- lip-reading
	- vsr
	- romanian
	- speech-recognition
	- audio-visual
	datasets:
	- vsro200/vsro200
	metrics:
	- wer
	---

	# VSRo-200: Romanian Visual Speech Recognition Models

	This repository hosts the encoder-decoder VSR model checkpoints introduced in the paper VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness.

	The models are MultiVSR backbones fine-tuned on the VSRo-200 corpus, a 200-hour collection of Romanian podcast recordings. For training code, data preparation scripts, and inference instructions, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).

	## Checkpoints

	All checkpoints follow the naming pattern `model_[hours]_[type].pt`:

	- `_annot` — trained on human-annotated transcriptions
	- `_auto` — trained on automatically generated pseudo-labels
	- `_shuffle` — alternative data splits used for variance analysis (100h models)
	- `_males` / `_females` / `_mix` — gender-controlled 40h subsets used for bias analysis

	## Results

	All results are reported in Word Error Rate (WER, %) and Character Error Rate (CER, %) on the Test Unseen and Test Seen splits. Lower is better.

	#### Human Annotated Data

	\| Training Hours \| Test Unseen WER (%) \| Test Unseen CER (%) \| Test Seen WER (%) \| Test Seen CER (%) \|
	\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| 10h \| 72.50 \| 41.49 \| 67.01 \| 37.53 \|
	\| 25h \| 64.86 \| 36.62 \| 59.23 \| 32.96 \|
	\| 50h \| 58.87 \| 33.38 \| 54.03 \| 29.88 \|
	\| 75h \| 54.86 \| 30.97 \| 51.44 \| 28.61 \|
	\| 100h \| 53.29 \| 29.94 \| 48.16 \| 26.53 \|

	#### Whisper Pseudo Labels

	\| Training Hours \| Test Unseen WER (%) \| Test Unseen CER (%) \| Test Seen WER (%) \| Test Seen CER (%) \|
	\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| 10h \| 74.61 \| 42.09 \| 68.41 \| 38.22 \|
	\| 25h \| 66.27 \| 37.05 \| 60.40 \| 33.36 \|
	\| 50h \| 59.28 \| 33.15 \| 55.39 \| 30.65 \|
	\| 75h \| 56.25 \| 31.18 \| 51.56 \| 28.33 \|
	\| 100h \| 53.63 \| 30.12 \| 49.61 \| 27.22 \|
	\| 125h \| 51.71 \| 29.04 \| 48.68 \| 26.58 \|
	\| 150h \| 51.25 \| 28.40 \| 47.05 \| 25.64 \|
	\| 175h \| 49.84 \| 27.66 \| 46.44 \| 25.30 \|
	\| 200h \| 48.75 \| 27.05 \| 44.54 \| 24.51 \|

	A variance analysis across three random shuffles of the 100h subsets yields a mean Word Error Rate (WER) of 53.21% (± 0.37) for the human-annotated data and 53.82% (± 0.17) for the auto-generated data.



	### Out-of-distribution robustness

	* Test Seen / Unseen (In-Domain): Baseline performance on podcast data, tested on our 200h-model.
	* Vlogs: Unconstrained videos shot in different camera angles, dynamic lighting, movement.
	* Specific domains: Content featuring highly specialized or technical vocabulary (e.g., medical, scientific).
	* Noisy: Videos with poor resolution, bad lighting, or heavy motion blur.
	* Archival (Black & White): Historical footage with distinct visual artifacts, atypical framerates, and lack of color information.
	* Global OOD: The aggregated metrics across all out-of-distribution subsets.

	\| Dataset / Category \| # Clips \| WER (%) \| CER (%) \| OOV Token (%) \| OOV Type (%) \|
	\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| Test Seen \| 386 \| 44.54 \| 24.51 \| 1.67 \| 6.93 \|
	\| Test Unseen \| 389 \| 48.75 \| 27.05 \| 2.30 \| 8.50 \|
	\| OOD: Vlogs \| 99 \| 58.61 \| 32.85 \| 1.49 \| 4.26 \|
	\| OOD: Specific domains \| 84 \| 63.01 \| 28.73 \| 9.78 \| 17.93 \|
	\| OOD: Noisy \| 100 \| 68.96 \| 33.68 \| 6.19 \| 12.88 \|
	\| OOD: Archival \| 92 \| 87.97 \| 50.44 \| 5.24 \| 10.96 \|
	\| Global OOD \| 375 \| 68.46 \| 35.99 \| 5.08 \| 14.75 \|

	#### Metrics Note
	* Duration: Each OOD category consists of 15 minutes of video content.
	* OOV Token (%): The percentage of total words in the evaluation set that do not appear in the training data. Measures how often unknown words occur.
	* OOV Type (%): The percentage of unique words in the evaluation set that do not appear in the training data. Measures the diversity of unknown words.

	### Gender bias analysis (40h models)


	To evaluate gender bias and cross-speaker generalization, we trained 40-hour baseline models on male-only, female-only, and mixed datasets.

	#### Test Unseen
	\| Training Set (40h) \| Global WER (%) \| Global CER (%) \| Male WER (%) \| Male CER (%) \| Female WER (%) \| Female CER (%) \|
	\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| Males Only \| 62.15 \| 35.23 \| 61.32 \| 34.51 \| 62.97 \| 35.95 \|
	\| Females Only \| 59.33 \| 33.44 \| 59.17 \| 32.87 \| 59.49 \| 34.02 \|
	\| Mixed Data \| 59.52 \| 33.74 \| 59.19 \| 33.26 \| 59.85 \| 34.22 \|

	#### Test Seen
	\| Training Set (40h) \| Global WER (%) \| Global CER (%) \| Male WER (%) \| Male CER (%) \| Female WER (%) \| Female CER (%) \|
	\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| Males Only \| 58.82 \| 33.11 \| 58.58 \| 32.59 \| 59.06 \| 33.63 \|
	\| Females Only \| 59.10 \| 33.30 \| 67.26 \| 38.67 \| 51.20 \| 27.99 \|
	\| Mixed Data \| 56.29 \| 31.22 \| 60.56 \| 33.54 \| 52.15 \| 28.93 \|



	## Citation

	If you use these models, please cite:

	```bibtex
	@inproceedings{vsro200,
	title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
	author = {...},
	year = {...}
	}
	```