vsro200 commited on
Commit
b8b25e5
·
verified ·
1 Parent(s): 09a5fb7

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -43
README.md CHANGED
@@ -1,69 +1,109 @@
1
  ---
 
 
 
2
  library_name: transformers
 
3
  base_model: alexandradiaconu/whisper-small-echo-34
4
  tags:
5
  - generated_from_trainer
 
 
 
 
 
 
 
 
 
 
6
  model-index:
7
  - name: whisper-small-vsro200
8
  results: []
9
  ---
10
 
11
- # Romanian Noisy Whisper Small (AVSR Audio Component)
12
 
13
- This repository contains the audio backbone model used for the Audio-Visual Speech Recognition (AVSR) pipeline via **Shallow Fusion**, introduced in our paper **"VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness"**.
14
 
15
- ## 🎙️ Model Overview
16
- This model is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted specifically to handle challenging acoustic environments in the Romanian language.
17
 
18
- In real-world scenarios (like podcasts or vlogs), the audio track is rarely clean. To make our audio processing robust and suitable for AVSR, we fine-tuned this Whisper model on our custom dataset with **artificially added noise** from the MUSAN library. The core purpose of this model is to serve as the acoustic counterpart to our Romanian VSR model.
19
 
20
- ## 🔗 The Shallow Fusion (AVSR) Approach
21
- Visual Speech Recognition (VSR) alone is limited by visual ambiguities, while Audio Speech Recognition (ASR) fails in noisy environments.
22
 
23
- We implemented a **Shallow Fusion** mechanism during the decoding phase (beam search), by combining the output probabilities of this fine-tuned noisy Whisper model (Acoustic) with our [Romanian VSR models](https://huggingface.co/vsro200/models-vsro200) (Visual).
24
 
25
- ## 🏆 AVSR Performance (Word Error Rate)
26
- The following table demonstrates the effectiveness of our Shallow Fusion pipeline across different noise types (Gaussian and Babble) and Signal-to-Noise Ratio (SNR) levels.
27
 
28
- *Note: Results were evaluated on a random subset of 100 clips from the `test_unseen` set of the VSRo-200 dataset.*
29
 
30
- ### 🔊 Performance under Noise Degradation
 
 
 
 
 
 
31
 
32
- | Noise Type | SNR (dB) | Whisper Zero-Shot (%) | Whisper Fine-Tuned *(This Model)* (%) | MultiVSR *(Visual Only)* (%) | Fusion (Zero-Shot + VSR) (%) | **Fusion (Fine-Tuned + VSR)** (%) |
33
- |:---|:---:|:---:|:---:|:---:|:---:|:---:|
34
- | Gaussian | -5 | 89.43 | 83.16 | 50.52 | 78.68 | **40.92** |
35
- | Gaussian | 0 | 67.32 | 42.78 | 50.52 | 46.04 | **25.98** |
36
- | Gaussian | 5 | 48.37 | 24.79 | 50.52 | 37.48 | **19.63** |
37
- | Gaussian | 10 | 32.43 | 17.34 | 50.52 | 22.93 | **15.66** |
38
- | Gaussian | 15 | 23.97 | 13.87 | 50.52 | 18.92 | **12.50** |
39
- | Babble | -5 | 94.95 | 141.78 | 50.52 | 79.65 | **45.61** |
40
- | Babble | 0 | 73.38 | 49.77 | 50.52 | 56.11 | **29.77** |
41
- | Babble | 5 | 46.40 | 21.50 | 50.52 | 31.96 | **18.63** |
42
- | Babble | 10 | 29.24 | **14.22** | 50.52 | 21.86 | 15.16 |
43
- | Babble | 15 | 22.36 | **11.68** | 50.52 | 18.77 | 12.65 |
44
 
45
- **Observation:** At extreme noise levels (e.g., Babble noise at -5 dB SNR), the standalone fine-tuned audio model degrades completely (141.78% WER). However, through Shallow Fusion with the MultiVSR component, the system forces the decoder to trust visual cues, recovering the performance down to a remarkable **45.61% WER**.
 
 
 
 
 
 
46
 
47
- ## 📊 Training procedure
48
 
49
- ### Training hyperparameters
50
 
51
- The following hyperparameters were used during training:
52
- - learning_rate: 1e-05
53
- - train_batch_size: 8
54
- - eval_batch_size: 8
55
- - seed: 42
56
- - gradient_accumulation_steps: 4
57
- - total_train_batch_size: 32
58
- - optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
59
- - lr_scheduler_type: linear
60
- - lr_scheduler_warmup_steps: 100
61
- - num_epochs: 3
62
- - mixed_precision_training: Native AMP
63
 
64
  ### Framework versions
65
 
66
- - Transformers 5.0.0
67
- - Pytorch 2.10.0+cu128
68
- - Datasets 4.0.0
69
- - Tokenizers 0.22.2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - ro
4
+ license: apache-2.0
5
  library_name: transformers
6
+ pipeline_tag: automatic-speech-recognition
7
  base_model: alexandradiaconu/whisper-small-echo-34
8
  tags:
9
  - generated_from_trainer
10
+ - whisper
11
+ - speech-recognition
12
+ - romanian
13
+ - noisy-speech
14
+ - avsr
15
+ - audio-visual-speech-recognition
16
+ datasets:
17
+ - vsro200/vsro200
18
+ metrics:
19
+ - wer
20
  model-index:
21
  - name: whisper-small-vsro200
22
  results: []
23
  ---
24
 
25
+ # Noisy Whisper Small for Romanian AVSR
26
 
27
+ This is the audio backbone used in the Audio-Visual Speech Recognition (AVSR) pipeline introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.
28
 
29
+ It is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted to handle noisy acoustic conditions in Romanian. Training data was augmented with noise samples from the MUSAN corpus.
 
30
 
31
+ The model is paired with our [Romanian VSR models](https://huggingface.co/vsro200/models-vsro200) through **shallow fusion** at decoding time, combining acoustic and visual probabilities during beam search.
32
 
33
+ For the full AVSR pipeline, fusion implementation, and inference scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).
 
34
 
35
+ ## Results
36
 
37
+ Evaluated on a 100-clip subset of the VSRo-200 `test_unseen` split, under two noise types (Gaussian and Babble) and varying signal-to-noise ratios. All values are WER (%); lower is better.
 
38
 
39
+ ### Gaussian noise
40
 
41
+ | SNR (dB) | Whisper zero-shot | Whisper fine-tuned | MultiVSR (visual) | Fusion (zero-shot + VSR) | Fusion (fine-tuned + VSR) |
42
+ |:---:|:---:|:---:|:---:|:---:|:---:|
43
+ | -5 | 89.43 | 83.16 | 50.52 | 78.68 | 40.92 |
44
+ | 0 | 67.32 | 42.78 | 50.52 | 46.04 | 25.98 |
45
+ | 5 | 48.37 | 24.79 | 50.52 | 37.48 | 19.63 |
46
+ | 10 | 32.43 | 17.34 | 50.52 | 22.93 | 15.66 |
47
+ | 15 | 23.97 | 13.87 | 50.52 | 18.92 | 12.50 |
48
 
49
+ ### Babble noise
 
 
 
 
 
 
 
 
 
 
 
50
 
51
+ | SNR (dB) | Whisper zero-shot | Whisper fine-tuned | MultiVSR (visual) | Fusion (zero-shot + VSR) | Fusion (fine-tuned + VSR) |
52
+ |:---:|:---:|:---:|:---:|:---:|:---:|
53
+ | -5 | 94.95 | 141.78 | 50.52 | 79.65 | 45.61 |
54
+ | 0 | 73.38 | 49.77 | 50.52 | 56.11 | 29.77 |
55
+ | 5 | 46.40 | 21.50 | 50.52 | 31.96 | 18.63 |
56
+ | 10 | 29.24 | 14.22 | 50.52 | 21.86 | 15.16 |
57
+ | 15 | 22.36 | 11.68 | 50.52 | 18.77 | 12.65 |
58
 
59
+ At extreme noise levels (e.g., babble at -5 dB), the standalone fine-tuned audio model collapses (141.78% WER). Shallow fusion with the visual stream forces the decoder to rely on lip-reading cues and recovers performance to 45.61% WER, demonstrating the value of multimodal integration in adverse acoustic conditions.
60
 
61
+ ## Training hyperparameters
62
 
63
+ | Parameter | Value |
64
+ |:---|:---|
65
+ | Learning rate | 1e-05 |
66
+ | Train batch size | 8 |
67
+ | Eval batch size | 8 |
68
+ | Gradient accumulation steps | 4 |
69
+ | Effective batch size | 32 |
70
+ | Optimizer | AdamW (torch fused), β=(0.9, 0.999), ε=1e-08 |
71
+ | LR scheduler | Linear, 100 warmup steps |
72
+ | Epochs | 3 |
73
+ | Mixed precision | Native AMP |
74
+ | Seed | 42 |
75
 
76
  ### Framework versions
77
 
78
+ Transformers 5.0.0 · PyTorch 2.10.0+cu128 · Datasets 4.0.0 · Tokenizers 0.22.2
79
+
80
+ ## Citation
81
+
82
+ If you use this model, please cite:
83
+
84
+ ```bibtex
85
+ @inproceedings{vsro200,
86
+ title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
87
+ author = {...},
88
+ year = {2026}
89
+ }
90
+
91
+ ```
92
+
93
+ ```bibtex
94
+ @article{diaconu2026ron3ws,
95
+ title={RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks},
96
+ author={Diaconu, Alexandra and Vînaga, Mădălina and Alexe, Bogdan},
97
+ journal={arXiv preprint arXiv:2603.02368},
98
+ year={2026}
99
+ }
100
+ ```
101
+
102
+ ```bibtex
103
+ @article{radford2022whisper,
104
+ title = {Robust Speech Recognition via Large-Scale Weak Supervision},
105
+ author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
106
+ journal = {arXiv preprint arXiv:2212.04356},
107
+ year = {2022}
108
+ }
109
+ ```