vsro200 commited on
Commit
1caa944
·
verified ·
1 Parent(s): 7b13d18

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -6
README.md CHANGED
@@ -65,12 +65,12 @@ A variance analysis across three random shuffles of the 100h subsets yields a me
65
 
66
  ### Out-of-distribution robustness
67
 
68
- * **Test Seen / Unseen (In-Domain):** Baseline performance on podcast data. "Seen" includes speakers present in the training set, while "Unseen" evaluates zero-shot speaker generalization.
69
- * **Vlogs:** Unconstrained videos shot in varied, less controlled environments (different camera angles, dynamic lighting, movement).
70
- * **Specific domains:** Content featuring highly specialized or technical vocabulary (e.g., medical, scientific). This category heavily tests the model's robustness to Out-Of-Vocabulary (OOV) words, exhibiting the highest OOV Type rate (17.93%).
71
  * **Noisy:** Videos with poor resolution, bad lighting, or heavy motion blur.
72
- * **Archival (Black & White):** Historical footage with distinct visual artifacts, atypical framerates, and lack of color information. This represents the hardest challenge for the visual front-end.
73
- * **Global OOD:** The aggregated metrics across all out-of-distribution subsets, providing a single macro-score for the model's robustness in the wild.
74
 
75
  | Dataset / Category | # Clips | WER (%) | CER (%) | OOV Token (%) | OOV Type (%) |
76
  |:---|:---:|:---:|:---:|:---:|:---:|
@@ -90,7 +90,7 @@ A variance analysis across three random shuffles of the 100h subsets yields a me
90
  ### Gender bias analysis (40h models)
91
 
92
 
93
- To evaluate gender bias and cross-speaker generalization, we trained 40-hour baseline models on male-only, female-only, and mixed datasets. The results reveal that while mixed data optimizes in-domain performance, training exclusively on female speakers provides more robust visual representations, leading to the best zero-shot generalization across both genders.
94
 
95
  #### Test Unseen
96
  | Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) |
@@ -107,6 +107,7 @@ To evaluate gender bias and cross-speaker generalization, we trained 40-hour bas
107
  | Mixed Data | **56.29** | **31.22** | 60.56 | 33.54 | 52.15 | 28.93 |
108
 
109
 
 
110
  ## Citation
111
 
112
  If you use these models, please cite:
 
65
 
66
  ### Out-of-distribution robustness
67
 
68
+ * **Test Seen / Unseen (In-Domain):** Baseline performance on podcast data, tested on our 200h-model.
69
+ * **Vlogs:** Unconstrained videos shot in different camera angles, dynamic lighting, movement.
70
+ * **Specific domains:** Content featuring highly specialized or technical vocabulary (e.g., medical, scientific).
71
  * **Noisy:** Videos with poor resolution, bad lighting, or heavy motion blur.
72
+ * **Archival (Black & White):** Historical footage with distinct visual artifacts, atypical framerates, and lack of color information.
73
+ * **Global OOD:** The aggregated metrics across all out-of-distribution subsets.
74
 
75
  | Dataset / Category | # Clips | WER (%) | CER (%) | OOV Token (%) | OOV Type (%) |
76
  |:---|:---:|:---:|:---:|:---:|:---:|
 
90
  ### Gender bias analysis (40h models)
91
 
92
 
93
+ To evaluate gender bias and cross-speaker generalization, we trained 40-hour baseline models on male-only, female-only, and mixed datasets.
94
 
95
  #### Test Unseen
96
  | Training Set (40h) | Global WER (%) | Global CER (%) | Male WER (%) | Male CER (%) | Female WER (%) | Female CER (%) |
 
107
  | Mixed Data | **56.29** | **31.22** | 60.56 | 33.54 | 52.15 | 28.93 |
108
 
109
 
110
+
111
  ## Citation
112
 
113
  If you use these models, please cite: