Update README.md
Browse files
README.md
CHANGED
|
@@ -4,15 +4,13 @@ base_model: alexandradiaconu/whisper-small-echo-34
|
|
| 4 |
tags:
|
| 5 |
- generated_from_trainer
|
| 6 |
model-index:
|
| 7 |
-
- name: whisper-small-
|
| 8 |
results: []
|
| 9 |
-
datasets:
|
| 10 |
-
- iulik-pisik/ro_vsr
|
| 11 |
---
|
| 12 |
|
| 13 |
# Romanian Noisy Whisper Small (AVSR Audio Component)
|
| 14 |
|
| 15 |
-
This repository contains the audio backbone model used for the Audio-Visual Speech Recognition (AVSR) pipeline via **Shallow Fusion**, introduced in our paper **"
|
| 16 |
|
| 17 |
## ποΈ Model Overview
|
| 18 |
This model is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted specifically to handle challenging acoustic environments in the Romanian language.
|
|
@@ -22,12 +20,12 @@ In real-world scenarios (like podcasts or vlogs), the audio track is rarely clea
|
|
| 22 |
## π The Shallow Fusion (AVSR) Approach
|
| 23 |
Visual Speech Recognition (VSR) alone is limited by visual ambiguities, while Audio Speech Recognition (ASR) fails in noisy environments.
|
| 24 |
|
| 25 |
-
We implemented a **Shallow Fusion** mechanism during the decoding phase (beam search)
|
| 26 |
|
| 27 |
## π AVSR Performance (Word Error Rate)
|
| 28 |
The following table demonstrates the effectiveness of our Shallow Fusion pipeline across different noise types (Gaussian and Babble) and Signal-to-Noise Ratio (SNR) levels.
|
| 29 |
|
| 30 |
-
*Note: Results were evaluated on a random subset of 100 clips from the `test_unseen` set of the
|
| 31 |
|
| 32 |
### π Performance under Noise Degradation
|
| 33 |
|
|
@@ -41,12 +39,12 @@ The following table demonstrates the effectiveness of our Shallow Fusion pipelin
|
|
| 41 |
| Babble | -5 | 94.95 | 141.78 | 50.52 | 79.65 | **45.61** |
|
| 42 |
| Babble | 0 | 73.38 | 49.77 | 50.52 | 56.11 | **29.77** |
|
| 43 |
| Babble | 5 | 46.40 | 21.50 | 50.52 | 31.96 | **18.63** |
|
| 44 |
-
| Babble | 10 | 29.24 | 14.22 | 50.52 | 21.86 |
|
| 45 |
-
| Babble | 15 | 22.36 | 11.68 | 50.52 | 18.77 |
|
| 46 |
|
| 47 |
**Observation:** At extreme noise levels (e.g., Babble noise at -5 dB SNR), the standalone fine-tuned audio model degrades completely (141.78% WER). However, through Shallow Fusion with the MultiVSR component, the system forces the decoder to trust visual cues, recovering the performance down to a remarkable **45.61% WER**.
|
| 48 |
|
| 49 |
-
## Training procedure
|
| 50 |
|
| 51 |
### Training hyperparameters
|
| 52 |
|
|
|
|
| 4 |
tags:
|
| 5 |
- generated_from_trainer
|
| 6 |
model-index:
|
| 7 |
+
- name: whisper-small-vsro200
|
| 8 |
results: []
|
|
|
|
|
|
|
| 9 |
---
|
| 10 |
|
| 11 |
# Romanian Noisy Whisper Small (AVSR Audio Component)
|
| 12 |
|
| 13 |
+
This repository contains the audio backbone model used for the Audio-Visual Speech Recognition (AVSR) pipeline via **Shallow Fusion**, introduced in our paper **"VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness"**.
|
| 14 |
|
| 15 |
## ποΈ Model Overview
|
| 16 |
This model is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted specifically to handle challenging acoustic environments in the Romanian language.
|
|
|
|
| 20 |
## π The Shallow Fusion (AVSR) Approach
|
| 21 |
Visual Speech Recognition (VSR) alone is limited by visual ambiguities, while Audio Speech Recognition (ASR) fails in noisy environments.
|
| 22 |
|
| 23 |
+
We implemented a **Shallow Fusion** mechanism during the decoding phase (beam search), by combining the output probabilities of this fine-tuned noisy Whisper model (Acoustic) with our [Romanian VSR models](https://huggingface.co/vsro200/models-vsro200) (Visual).
|
| 24 |
|
| 25 |
## π AVSR Performance (Word Error Rate)
|
| 26 |
The following table demonstrates the effectiveness of our Shallow Fusion pipeline across different noise types (Gaussian and Babble) and Signal-to-Noise Ratio (SNR) levels.
|
| 27 |
|
| 28 |
+
*Note: Results were evaluated on a random subset of 100 clips from the `test_unseen` set of the VSRo-200 dataset.*
|
| 29 |
|
| 30 |
### π Performance under Noise Degradation
|
| 31 |
|
|
|
|
| 39 |
| Babble | -5 | 94.95 | 141.78 | 50.52 | 79.65 | **45.61** |
|
| 40 |
| Babble | 0 | 73.38 | 49.77 | 50.52 | 56.11 | **29.77** |
|
| 41 |
| Babble | 5 | 46.40 | 21.50 | 50.52 | 31.96 | **18.63** |
|
| 42 |
+
| Babble | 10 | 29.24 | **14.22** | 50.52 | 21.86 | 15.16 |
|
| 43 |
+
| Babble | 15 | 22.36 | **11.68** | 50.52 | 18.77 | 12.65 |
|
| 44 |
|
| 45 |
**Observation:** At extreme noise levels (e.g., Babble noise at -5 dB SNR), the standalone fine-tuned audio model degrades completely (141.78% WER). However, through Shallow Fusion with the MultiVSR component, the system forces the decoder to trust visual cues, recovering the performance down to a remarkable **45.61% WER**.
|
| 46 |
|
| 47 |
+
## π Training procedure
|
| 48 |
|
| 49 |
### Training hyperparameters
|
| 50 |
|