--- language: - ro library_name: pytorch pipeline_tag: video-text-to-text tags: - visual-speech-recognition - lip-reading - vsr - romanian - speech-recognition - audio-visual datasets: - vsro200/vsro200 metrics: - wer --- # VSRo-200: Romanian Visual Speech Recognition Models This repository hosts the encoder-decoder VSR model checkpoints introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*. The models are MultiVSR backbones fine-tuned on the **VSRo-200** corpus, a 200-hour collection of Romanian podcast recordings. For training code, data preparation scripts, and inference instructions, please refer to the [GitHub repository](https://github.com/vsro200/vsro200). ## Checkpoints All checkpoints follow the naming pattern `model_[hours]_[type].pt`: - `_annot` — trained on human-annotated transcriptions - `_auto` — trained on automatically generated pseudo-labels - `_shuffle` — alternative data splits used for variance analysis (100h models) - `_males` / `_females` / `_mix` — gender-controlled 40h subsets used for bias analysis ## Results All results are reported in Word Error Rate (WER, %) and Character Error Rate (CER, %) on the **Test Unseen** and **Test Seen** splits. Lower is better. #### Human Annotated Data | Training Hours | Test Unseen WER (%) | Test Unseen CER (%) | Test Seen WER (%) | Test Seen CER (%) | |:---:|:---:|:---:|:---:|:---:| | 10h | 72.50 | 41.49 | 67.01 | 37.53 | | 25h | 64.86 | 36.62 | 59.23 | 32.96 | | 50h | 58.87 | 33.38 | 54.03 | 29.88 | | 75h | 54.86 | 30.97 | 51.44 | 28.61 | | 100h | **53.29** | **29.94** | **48.16** | **26.53** | #### Whisper Pseudo Labels | Training Hours | Test Unseen WER (%) | Test Unseen CER (%) | Test Seen WER (%) | Test Seen CER (%) | |:---:|:---:|:---:|:---:|:---:| | 10h | 74.61 | 42.09 | 68.41 | 38.22 | | 25h | 66.27 | 37.05 | 60.40 | 33.36 | | 50h | 59.28 | 33.15 | 55.39 | 30.65 | | 75h | 56.25 | 31.18 | 51.56 | 28.33 | | 100h | 53.63 | 30.12 | 49.61 | 27.22 | | 125h | 51.71 | 29.04 | 48.68 | 26.58 | | 150h | 51.25 | 28.40 | 47.05 | 25.64 | | 175h | 49.84 | 27.66 | 46.44 | 25.30 | | 200h | **48.75** | **27.05** | **44.54** | **24.51** | A variance analysis across three random shuffles of the 100h subsets yields a mean Word Error Rate (WER) of 53.21% (± 0.37) for the human-annotated data and 53.82% (± 0.17) for the auto-generated data. ### Out-of-distribution robustness * **Test Seen / Unseen (In-Domain):** Baseline performance on podcast data, tested on our 200h-model. * **Vlogs:** Unconstrained videos shot in different camera angles, dynamic lighting, movement. * **Specific domains:** Content featuring highly specialized or technical vocabulary (e.g., medical, scientific). * **Noisy:** Videos with poor resolution, bad lighting, or heavy motion blur. * **Archival (Black & White):** Historical footage with distinct visual artifacts, atypical framerates, and lack of color information. * **Global OOD:** The aggregated metrics across all out-of-distribution subsets. | Dataset / Category | # Clips | WER (%) | CER (%) | OOV Token (%) | OOV Type (%) | |:---|:---:|:---:|:---:|:---:|:---:| | **Test Seen** | 386 | 44.54 | 24.51 | 1.67 | 6.93 | | **Test Unseen** | 389 | 48.75 | 27.05 | 2.30 | 8.50 | | **OOD: Vlogs** | 99 | 58.61 | 32.85 | 1.49 | 4.26 | | **OOD: Specific domains** | 84 | 63.01 | 28.73 | 9.78 | 17.93 | | **OOD: Noisy** | 100 | 68.96 | 33.68 | 6.19 | 12.88 | | **OOD: Archival** | 92 | 87.97 | 50.44 | 5.24 | 10.96 | | **Global OOD** | 375 | 68.46 | 35.99 | 5.08 | 14.75 | #### Metrics Note * **Duration:** Each OOD category consists of 15 minutes of video content. * **OOV Token (%):** The percentage of *total words* in the evaluation set that do not appear in the training data. Measures how often unknown words occur. * **OOV Type (%):** The percentage of *unique words* in the evaluation set that do not appear in the training data. Measures the diversity of unknown words. ### Gender bias analysis (40h models) To evaluate gender bias and cross-speaker generalization, we trained 40-hour baseline models on male-only, female-only, and mixed datasets. #### Test Unseen | Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) | |:---|:---:|:---:|:---:|:---:|:---:|:---:| | Males Only | 62.15 | 35.23 | 61.32 | 34.51 | 62.97 | 35.95 | | Females Only | **59.33** | **33.44** | **59.17** | **32.87** | **59.49** | **34.02** | | Mixed Data | 59.52 | 33.74 | 59.19 | 33.26 | 59.85 | 34.22 | #### Test Seen | Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) | |:---|:---:|:---:|:---:|:---:|:---:|:---:| | Males Only | 58.82 | 33.11 | **58.58** | **32.59** | 59.06 | 33.63 | | Females Only | 59.10 | 33.30 | 67.26 | 38.67 | **51.20** | **27.99** | | Mixed Data | **56.29** | **31.22** | 60.56 | 33.54 | 52.15 | 28.93 | ## Citation If you use these models, please cite: ```bibtex @inproceedings{vsro200, title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness}, author = {...}, year = {...} } ```