vsro200
/

models-vsro200

+---
+language:
+- ro
+---
+# Romanian Visual Speech Recognition (VSR) Models
+This repository contains the model checkpoints for the paper **"VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness"**.
+These models are fine-tuned versions of MultiVSR, specifically trained for the Romanian language. We provide various checkpoints to demonstrate the impact of dataset size, annotation quality (human-annotated vs. automatically generated pseudo-labels), and gender distribution on VSR performance.
+## 📊 Accompanying Dataset
+The models were trained and evaluated on the **RoVSR Dataset** (Romanian Visual Speech Recognition Dataset), a 200-hour corpus of Romanian podcasts.
+* **Dataset Link:** [vsro200/vsro200_dataset](https://huggingface.co/datasets/vsro200/vsro200_dataset)
+## 📂 Repository Structure
+All model checkpoints are stored in the `checkpoints/` directory. The naming convention follows the pattern: `model_[hours]_[type].pt`.
+* `_annot`: Models trained on human-annotated data.
+* `_auto`: Models trained on automatically transcribed data (pseudo-labels).
+* `_shuffle`: Alternative data splits for the 100h models to test variance.
+* `_males` / `_females` / `_mix`: Models trained specifically on gender-segregated or mixed 40-hour annotated subsets to evaluate gender bias.
+---
+## 🏆 Performance (Word Error Rate - WER)
+Below are the primary results evaluated on the **Test Unseen** set. Lower WER indicates better performance.
+### 1. Annotated vs. Auto Data Scaling
+Comparison of models trained on perfectly annotated data versus those trained on automatically generated labels across different dataset sizes.
+| Training Hours | Human Annotated (`_annot`) (%) | Auto Generated (`_auto`) (%) |
+|:---:|:---:|:---:|
+| 10h | 72.50 | 74.61 |
+| 25h | 64.86 | 66.27 |
+| 50h | 58.87 | 59.28 |
+| 75h | 54.86 | 56.25 |
+| 100h | 53.29 | 53.63 |
+| 125h | -- | 51.71 |
+| 150h | -- | 51.25 |
+| 175h | -- | 49.84 |
+| 200h | -- | 48.75 |
+### 2. Gender Bias Analysis (40h Models)
+Evaluation demonstrating the impact of gender representation in the training set.
+| Training Subset | Global WER (%) | WER Males (%) | WER Females (%) |
+|:---|:---:|:---:|:---:|
+| 40h Males | 62.15 | 61.32 | 62.97 |
+| 40h Females | 59.33 | 59.17 | 59.49 |
+| 40h Mix | 59.52 | 59.19 | 59.85 |
+### 3. Out of Distribution (OOD) Robustness
+Evaluated using the `model_200h_auto.pt` checkpoint on different video degradation and domain shift scenarios.
+| OOD Category | WER (%) |
+|:---|:---:|
+| Vlogs | 58.61 |
+| Specific domains | 63.01 |
+| Noisy | 68.96 |
+| Archival | 87.97 |
+| **Global OOD** | **68.46** |
+### 4. Stability and Variance Analysis
+Due to high computational resource requirements, comprehensive multiple-run variance testing was isolated to the 100-hour models. The models were trained across 3 different random data shuffles to observe stability and the true impact of human annotations versus auto-generated labels.
+| Data Type (100h) | Mean WER (%) | Standard Deviation ($\sigma$) (%) |
+|:---|:---:|:---:|
+| Human Annotated | 53.21 | ± 0.37 |
+| Auto Generated | 53.82 | ± 0.17 |
+---
+## 💻 Usage
+To use these models, you can download them directly using the `huggingface_hub` library in Python:
+```python
+from huggingface_hub import hf_hub_download
+# Download the 200h auto model
+model_path = hf_hub_download(
+    repo_id="vsro200/VSR-Models",
+    filename="checkpoints/model_200h_auto.pt",
+    repo_type="model"
+)
+print(f"Model downloaded to: {model_path}")
+```