vsro200
/

models-vsro200

@@ -65,12 +65,12 @@ A variance analysis across three random shuffles of the 100h subsets yields a me
 ### Out-of-distribution robustness
-*   **Test Seen / Unseen (In-Domain):** Baseline performance on podcast data. "Seen" includes speakers present in the training set, while "Unseen" evaluates zero-shot speaker generalization.
-*   **Vlogs:** Unconstrained videos shot in varied, less controlled environments (different camera angles, dynamic lighting, movement).
-*   **Specific domains:** Content featuring highly specialized or technical vocabulary (e.g., medical, scientific). This category heavily tests the model's robustness to Out-Of-Vocabulary (OOV) words, exhibiting the highest OOV Type rate (17.93%).
 *   **Noisy:** Videos with poor resolution, bad lighting, or heavy motion blur.
-*   **Archival (Black & White):** Historical footage with distinct visual artifacts, atypical framerates, and lack of color information. This represents the hardest challenge for the visual front-end.
-*   **Global OOD:** The aggregated metrics across all out-of-distribution subsets, providing a single macro-score for the model's robustness in the wild.
 | Dataset / Category | # Clips | WER (%) | CER (%) | OOV Token (%) | OOV Type (%) |
 |:---|:---:|:---:|:---:|:---:|:---:|
@@ -90,7 +90,7 @@ A variance analysis across three random shuffles of the 100h subsets yields a me
 ### Gender bias analysis (40h models)
-To evaluate gender bias and cross-speaker generalization, we trained 40-hour baseline models on male-only, female-only, and mixed datasets. The results reveal that while mixed data optimizes in-domain performance, training exclusively on female speakers provides more robust visual representations, leading to the best zero-shot generalization across both genders.
 #### Test Unseen
 | Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) |
@@ -107,6 +107,7 @@ To evaluate gender bias and cross-speaker generalization, we trained 40-hour bas
 | Mixed Data | **56.29** | **31.22** | 60.56 | 33.54 | 52.15 | 28.93 |
 ## Citation
 If you use these models, please cite:

 ### Out-of-distribution robustness
+*   **Test Seen / Unseen (In-Domain):** Baseline performance on podcast data, tested on our 200h-model.
+*   **Vlogs:** Unconstrained videos shot in different camera angles, dynamic lighting, movement.
+*   **Specific domains:** Content featuring highly specialized or technical vocabulary (e.g., medical, scientific).
 *   **Noisy:** Videos with poor resolution, bad lighting, or heavy motion blur.
+*   **Archival (Black & White):** Historical footage with distinct visual artifacts, atypical framerates, and lack of color information.
+*   **Global OOD:** The aggregated metrics across all out-of-distribution subsets.
 | Dataset / Category | # Clips | WER (%) | CER (%) | OOV Token (%) | OOV Type (%) |
 |:---|:---:|:---:|:---:|:---:|:---:|
 ### Gender bias analysis (40h models)
+To evaluate gender bias and cross-speaker generalization, we trained 40-hour baseline models on male-only, female-only, and mixed datasets.
 #### Test Unseen
 | Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) |
 | Mixed Data | **56.29** | **31.22** | 60.56 | 33.54 | 52.15 | 28.93 |
 ## Citation
 If you use these models, please cite: