vsro200 commited on
Commit
09a5fb7
Β·
verified Β·
1 Parent(s): 0ec7156

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -9
README.md CHANGED
@@ -4,15 +4,13 @@ base_model: alexandradiaconu/whisper-small-echo-34
4
  tags:
5
  - generated_from_trainer
6
  model-index:
7
- - name: whisper-small-ro-noisy
8
  results: []
9
- datasets:
10
- - iulik-pisik/ro_vsr
11
  ---
12
 
13
  # Romanian Noisy Whisper Small (AVSR Audio Component)
14
 
15
- This repository contains the audio backbone model used for the Audio-Visual Speech Recognition (AVSR) pipeline via **Shallow Fusion**, introduced in our paper **"[Insert Paper Title Here]"** (submitted to NeurIPS 2026).
16
 
17
  ## πŸŽ™οΈ Model Overview
18
  This model is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted specifically to handle challenging acoustic environments in the Romanian language.
@@ -22,12 +20,12 @@ In real-world scenarios (like podcasts or vlogs), the audio track is rarely clea
22
  ## πŸ”— The Shallow Fusion (AVSR) Approach
23
  Visual Speech Recognition (VSR) alone is limited by visual ambiguities, while Audio Speech Recognition (ASR) fails in noisy environments.
24
 
25
- We implemented a **Shallow Fusion** mechanism during the decoding phase (beam search). By combining the output probabilities of this fine-tuned noisy Whisper model (Acoustic) with our MultiVSR model (Visual), the system dynamically relies on visual cues when the audio signal is degraded, and vice-versa.
26
 
27
  ## πŸ† AVSR Performance (Word Error Rate)
28
  The following table demonstrates the effectiveness of our Shallow Fusion pipeline across different noise types (Gaussian and Babble) and Signal-to-Noise Ratio (SNR) levels.
29
 
30
- *Note: Results were evaluated on a random subset of 100 clips from the `test_unseen` set of the RoVSR dataset.*
31
 
32
  ### πŸ”Š Performance under Noise Degradation
33
 
@@ -41,12 +39,12 @@ The following table demonstrates the effectiveness of our Shallow Fusion pipelin
41
  | Babble | -5 | 94.95 | 141.78 | 50.52 | 79.65 | **45.61** |
42
  | Babble | 0 | 73.38 | 49.77 | 50.52 | 56.11 | **29.77** |
43
  | Babble | 5 | 46.40 | 21.50 | 50.52 | 31.96 | **18.63** |
44
- | Babble | 10 | 29.24 | 14.22 | 50.52 | 21.86 | **15.16** |
45
- | Babble | 15 | 22.36 | 11.68 | 50.52 | 18.77 | **12.65** |
46
 
47
  **Observation:** At extreme noise levels (e.g., Babble noise at -5 dB SNR), the standalone fine-tuned audio model degrades completely (141.78% WER). However, through Shallow Fusion with the MultiVSR component, the system forces the decoder to trust visual cues, recovering the performance down to a remarkable **45.61% WER**.
48
 
49
- ## Training procedure
50
 
51
  ### Training hyperparameters
52
 
 
4
  tags:
5
  - generated_from_trainer
6
  model-index:
7
+ - name: whisper-small-vsro200
8
  results: []
 
 
9
  ---
10
 
11
  # Romanian Noisy Whisper Small (AVSR Audio Component)
12
 
13
+ This repository contains the audio backbone model used for the Audio-Visual Speech Recognition (AVSR) pipeline via **Shallow Fusion**, introduced in our paper **"VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness"**.
14
 
15
  ## πŸŽ™οΈ Model Overview
16
  This model is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted specifically to handle challenging acoustic environments in the Romanian language.
 
20
  ## πŸ”— The Shallow Fusion (AVSR) Approach
21
  Visual Speech Recognition (VSR) alone is limited by visual ambiguities, while Audio Speech Recognition (ASR) fails in noisy environments.
22
 
23
+ We implemented a **Shallow Fusion** mechanism during the decoding phase (beam search), by combining the output probabilities of this fine-tuned noisy Whisper model (Acoustic) with our [Romanian VSR models](https://huggingface.co/vsro200/models-vsro200) (Visual).
24
 
25
  ## πŸ† AVSR Performance (Word Error Rate)
26
  The following table demonstrates the effectiveness of our Shallow Fusion pipeline across different noise types (Gaussian and Babble) and Signal-to-Noise Ratio (SNR) levels.
27
 
28
+ *Note: Results were evaluated on a random subset of 100 clips from the `test_unseen` set of the VSRo-200 dataset.*
29
 
30
  ### πŸ”Š Performance under Noise Degradation
31
 
 
39
  | Babble | -5 | 94.95 | 141.78 | 50.52 | 79.65 | **45.61** |
40
  | Babble | 0 | 73.38 | 49.77 | 50.52 | 56.11 | **29.77** |
41
  | Babble | 5 | 46.40 | 21.50 | 50.52 | 31.96 | **18.63** |
42
+ | Babble | 10 | 29.24 | **14.22** | 50.52 | 21.86 | 15.16 |
43
+ | Babble | 15 | 22.36 | **11.68** | 50.52 | 18.77 | 12.65 |
44
 
45
  **Observation:** At extreme noise levels (e.g., Babble noise at -5 dB SNR), the standalone fine-tuned audio model degrades completely (141.78% WER). However, through Shallow Fusion with the MultiVSR component, the system forces the decoder to trust visual cues, recovering the performance down to a remarkable **45.61% WER**.
46
 
47
+ ## πŸ“Š Training procedure
48
 
49
  ### Training hyperparameters
50