vsro200
/

whisper-small-vsro200

@@ -4,15 +4,13 @@ base_model: alexandradiaconu/whisper-small-echo-34
 tags:
 - generated_from_trainer
 model-index:
-- name: whisper-small-ro-noisy
   results: []
-datasets:
-- iulik-pisik/ro_vsr
 ---
 # Romanian Noisy Whisper Small (AVSR Audio Component)
-This repository contains the audio backbone model used for the Audio-Visual Speech Recognition (AVSR) pipeline via **Shallow Fusion**, introduced in our paper **"[Insert Paper Title Here]"** (submitted to NeurIPS 2026).
 ## 🎙️ Model Overview
 This model is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted specifically to handle challenging acoustic environments in the Romanian language.
@@ -22,12 +20,12 @@ In real-world scenarios (like podcasts or vlogs), the audio track is rarely clea
 ## 🔗 The Shallow Fusion (AVSR) Approach
 Visual Speech Recognition (VSR) alone is limited by visual ambiguities, while Audio Speech Recognition (ASR) fails in noisy environments.
-We implemented a **Shallow Fusion** mechanism during the decoding phase (beam search). By combining the output probabilities of this fine-tuned noisy Whisper model (Acoustic) with our MultiVSR model (Visual), the system dynamically relies on visual cues when the audio signal is degraded, and vice-versa.
 ## 🏆 AVSR Performance (Word Error Rate)
 The following table demonstrates the effectiveness of our Shallow Fusion pipeline across different noise types (Gaussian and Babble) and Signal-to-Noise Ratio (SNR) levels.
-*Note: Results were evaluated on a random subset of 100 clips from the `test_unseen` set of the RoVSR dataset.*
 ### 🔊 Performance under Noise Degradation
@@ -41,12 +39,12 @@ The following table demonstrates the effectiveness of our Shallow Fusion pipelin
 | Babble | -5 | 94.95 | 141.78 | 50.52 | 79.65 | **45.61** |
 | Babble | 0 | 73.38 | 49.77 | 50.52 | 56.11 | **29.77** |
 | Babble | 5 | 46.40 | 21.50 | 50.52 | 31.96 | **18.63** |
-| Babble | 10 | 29.24 | 14.22 | 50.52 | 21.86 | **15.16** |
-| Babble | 15 | 22.36 | 11.68 | 50.52 | 18.77 | **12.65** |
 **Observation:** At extreme noise levels (e.g., Babble noise at -5 dB SNR), the standalone fine-tuned audio model degrades completely (141.78% WER). However, through Shallow Fusion with the MultiVSR component, the system forces the decoder to trust visual cues, recovering the performance down to a remarkable **45.61% WER**.
-## Training procedure
 ### Training hyperparameters

 tags:
 - generated_from_trainer
 model-index:
+- name: whisper-small-vsro200
   results: []
 ---
 # Romanian Noisy Whisper Small (AVSR Audio Component)
+This repository contains the audio backbone model used for the Audio-Visual Speech Recognition (AVSR) pipeline via **Shallow Fusion**, introduced in our paper **"VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness"**.
 ## 🎙️ Model Overview
 This model is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted specifically to handle challenging acoustic environments in the Romanian language.
 ## 🔗 The Shallow Fusion (AVSR) Approach
 Visual Speech Recognition (VSR) alone is limited by visual ambiguities, while Audio Speech Recognition (ASR) fails in noisy environments.
+We implemented a **Shallow Fusion** mechanism during the decoding phase (beam search), by combining the output probabilities of this fine-tuned noisy Whisper model (Acoustic) with our [Romanian VSR models](https://huggingface.co/vsro200/models-vsro200) (Visual).
 ## 🏆 AVSR Performance (Word Error Rate)
 The following table demonstrates the effectiveness of our Shallow Fusion pipeline across different noise types (Gaussian and Babble) and Signal-to-Noise Ratio (SNR) levels.
+*Note: Results were evaluated on a random subset of 100 clips from the `test_unseen` set of the VSRo-200 dataset.*
 ### 🔊 Performance under Noise Degradation
 | Babble | -5 | 94.95 | 141.78 | 50.52 | 79.65 | **45.61** |
 | Babble | 0 | 73.38 | 49.77 | 50.52 | 56.11 | **29.77** |
 | Babble | 5 | 46.40 | 21.50 | 50.52 | 31.96 | **18.63** |
+| Babble | 10 | 29.24 | **14.22** | 50.52 | 21.86 | 15.16 |
+| Babble | 15 | 22.36 | **11.68** | 50.52 | 18.77 | 12.65 |
 **Observation:** At extreme noise levels (e.g., Babble noise at -5 dB SNR), the standalone fine-tuned audio model degrades completely (141.78% WER). However, through Shallow Fusion with the MultiVSR component, the system forces the decoder to trust visual cues, recovering the performance down to a remarkable **45.61% WER**.
+## 📊 Training procedure
 ### Training hyperparameters