vsro200
/

models-vsro200

@@ -1,7 +1,6 @@
 ---
 language:
 - ro
-license: cc-by-nc-4.0
 library_name: pytorch
 pipeline_tag: video-text-to-text
 tags:
@@ -21,7 +20,7 @@ metrics:
 This repository hosts the encoder-decoder VSR model checkpoints introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.
-The models are MultiVSR backbones fine-tuned on the **VSRo-200** corpus, a 200-hour collection of Romanian podcast recordings. For training code, data preparation scripts, and inference instructions, please refer to the [GitHub repository](https://https://github.com/vsro200/vsro200).
 ## Checkpoints
@@ -34,46 +33,79 @@ All checkpoints follow the naming pattern `model_[hours]_[type].pt`:
 ## Results
-All results are reported in Word Error Rate (WER, %) on the **Test Unseen** split. Lower is better.
-### Annotated vs. auto-labeled data scaling
-| Hours | Annotated | Auto |
-|:---:|:---:|:---:|
-| 10  | 72.50 | 74.61 |
-| 25  | 64.86 | 66.27 |
-| 50  | 58.87 | 59.28 |
-| 75  | 54.86 | 56.25 |
-| 100 | 53.29 | 53.63 |
-| 125 |   —   | 51.71 |
-| 150 |   —   | 51.25 |
-| 175 |   —   | 49.84 |
-| 200 |   —   | 48.75 |
-### Variance analysis (100h models, 3 random shuffles)
-| Data type | Mean WER | Std. dev. |
-|:---|:---:|:---:|
-| Human annotated | 53.21 | ± 0.37 |
-| Auto generated  | 53.82 | ± 0.17 |
 ### Gender bias analysis (40h models)
-| Training subset | Global | Males | Females |
-|:---|:---:|:---:|:---:|
-| 40h Males   | 62.15 | 61.32 | 62.97 |
-| 40h Females | 59.33 | 59.17 | 59.49 |
-| 40h Mix     | 59.52 | 59.19 | 59.85 |
-### Out-of-distribution robustness (`model_200h_auto.pt`)
-| OOD category     | WER   |
-|:---|:---:|
-| Vlogs            | 58.61 |
-| Specific domains | 63.01 |
-| Noisy            | 68.96 |
-| Archival         | 87.97 |
-| Global OOD       | 68.46 |
 ## Citation
@@ -85,4 +117,4 @@ If you use these models, please cite:
   author = {...},
   year   = {...}
 }
-```

 ---
 language:
 - ro
 library_name: pytorch
 pipeline_tag: video-text-to-text
 tags:
 This repository hosts the encoder-decoder VSR model checkpoints introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.
+The models are MultiVSR backbones fine-tuned on the **VSRo-200** corpus, a 200-hour collection of Romanian podcast recordings. For training code, data preparation scripts, and inference instructions, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).
 ## Checkpoints
 ## Results
+All results are reported in Word Error Rate (WER, %) and Character Error Rate (CER, %) on the **Test Unseen** and **Test Seen** splits. Lower is better.
+#### Human Annotated Data
+| Training Hours | Test Unseen WER (%) | Test Unseen CER (%) | Test Seen WER (%) | Test Seen CER (%) |
+|:---:|:---:|:---:|:---:|:---:|
+| 10h | 72.50 | 41.49 | 67.01 | 37.53 |
+| 25h | 64.86 | 36.62 | 59.23 | 32.96 |
+| 50h | 58.87 | 33.38 | 54.03 | 29.88 |
+| 75h | 54.86 | 30.97 | 51.44 | 28.61 |
+| 100h | **53.29** | **29.94** | **48.16** | **26.53** |
+#### Whisper Pseudo Labels
+| Training Hours | Test Unseen WER (%) | Test Unseen CER (%) | Test Seen WER (%) | Test Seen CER (%) |
+|:---:|:---:|:---:|:---:|:---:|
+| 10h | 74.61 | 42.09 | 68.41 | 38.22 |
+| 25h | 66.27 | 37.05 | 60.40 | 33.36 |
+| 50h | 59.28 | 33.15 | 55.39 | 30.65 |
+| 75h | 56.25 | 31.18 | 51.56 | 28.33 |
+| 100h | 53.63 | 30.12 | 49.61 | 27.22 |
+| 125h | 51.71 | 29.04 | 48.68 | 26.58 |
+| 150h | 51.25 | 28.40 | 47.05 | 25.64 |
+| 175h | 49.84 | 27.66 | 46.44 | 25.30 |
+| 200h | **48.75** | **27.05** | **44.54** | **24.51** |
+A variance analysis across three random shuffles of the 100h subsets yields a mean Word Error Rate (WER) of 53.21% (± 0.37) for the human-annotated data and 53.82% (± 0.17) for the auto-generated data.
+### Out-of-distribution robustness
+*   **Test Seen / Unseen (In-Domain):** Baseline performance on podcast data. "Seen" includes speakers present in the training set, while "Unseen" evaluates zero-shot speaker generalization.
+*   **Vlogs:** Unconstrained videos shot in varied, less controlled environments (different camera angles, dynamic lighting, movement).
+*   **Specific domains:** Content featuring highly specialized or technical vocabulary (e.g., medical, scientific). This category heavily tests the model's robustness to Out-Of-Vocabulary (OOV) words, exhibiting the highest OOV Type rate (17.93%).
+*   **Noisy:** Videos with poor resolution, bad lighting, or heavy motion blur.
+*   **Archival (Black & White):** Historical footage with distinct visual artifacts, atypical framerates, and lack of color information. This represents the hardest challenge for the visual front-end.
+*   **Global OOD:** The aggregated metrics across all out-of-distribution subsets, providing a single macro-score for the model's robustness in the wild.
+| Dataset / Category | # Clips | WER (%) | CER (%) | OOV Token (%) | OOV Type (%) |
+|:---|:---:|:---:|:---:|:---:|:---:|
+| **Test Seen** | 386 | 44.54 | 24.51 | 1.67 | 6.93 |
+| **Test Unseen** | 389 | 48.75 | 27.05 | 2.30 | 8.50 |
+| **OOD: Vlogs** | 99 | 58.61 | 32.85 | 1.49 | 4.26 |
+| **OOD: Specific domains** | 84 | 63.01 | 28.73 | 9.78 | 17.93 |
+| **OOD: Noisy** | 100 | 68.96 | 33.68 | 6.19 | 12.88 |
+| **OOD: Archival** | 92 | 87.97 | 50.44 | 5.24 | 10.96 |
+| **Global OOD** | 375 | 68.46 | 35.99 | 5.08 | 14.75 |
+#### Metrics Note
+*   **Duration:** Each OOD category consists of 15 minutes of video content.
+*   **OOV Token (%):** The percentage of *total words* in the evaluation set that do not appear in the training data. Measures how often unknown words occur.
+*   **OOV Type (%):** The percentage of *unique words* in the evaluation set that do not appear in the training data. Measures the diversity of unknown words.
 ### Gender bias analysis (40h models)
+To evaluate gender bias and cross-speaker generalization, we trained 40-hour baseline models on male-only, female-only, and mixed datasets. The results reveal that while mixed data optimizes in-domain performance, training exclusively on female speakers provides more robust visual representations, leading to the best zero-shot generalization across both genders.
+#### Test Unseen
+| Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) |
+|:---|:---:|:---:|:---:|:---:|:---:|:---:|
+| Males Only | 62.15 | 35.23 | 61.32 | 34.51 | 62.97 | 35.95 |
+| Females Only | **59.33** | **33.44** | **59.17** | **32.87** | **59.49** | **34.02** |
+| Mixed Data | 59.52 | 33.74 | 59.19 | 33.26 | 59.85 | 34.22 |
+#### Test Seen
+| Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) |
+|:---|:---:|:---:|:---:|:---:|:---:|:---:|
+| Males Only | 58.82 | 33.11 | **58.58** | **32.59** | 59.06 | 33.63 |
+| Females Only | 59.10 | 33.30 | 67.26 | 38.67 | **51.20** | **27.99** |
+| Mixed Data | **56.29** | **31.22** | 60.56 | 33.54 | 52.15 | 28.93 |
 ## Citation
   author = {...},
   year   = {...}
 }
+```