File size: 3,735 Bytes
ec9aec7
b8b25e5
 
ec9aec7
b8b25e5
ec9aec7
 
 
b8b25e5
 
 
 
 
 
 
 
 
 
ec9aec7
09a5fb7
ec9aec7
 
 
b8b25e5
ec9aec7
b8b25e5
ec9aec7
b8b25e5
ec9aec7
b8b25e5
ec9aec7
b8b25e5
ec9aec7
b8b25e5
ec9aec7
b8b25e5
ec9aec7
b8b25e5
ec9aec7
52a5189
b8b25e5
52a5189
 
 
 
 
0ec7156
b8b25e5
0ec7156
52a5189
b8b25e5
52a5189
 
 
 
 
ec9aec7
52a5189
ec9aec7
b8b25e5
ec9aec7
b8b25e5
 
 
 
 
 
 
 
 
 
 
 
ec9aec7
 
 
b8b25e5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
language:
- ro
library_name: transformers
pipeline_tag: automatic-speech-recognition
base_model: alexandradiaconu/whisper-small-echo-34
tags:
- generated_from_trainer
- whisper
- speech-recognition
- romanian
- noisy-speech
- avsr
- audio-visual-speech-recognition
datasets:
- vsro200/vsro200
metrics:
- wer
model-index:
- name: whisper-small-vsro200
  results: []
---

# Noisy Whisper Small for Romanian AVSR

This is the audio backbone used in the Audio-Visual Speech Recognition (AVSR) pipeline introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.

It is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted to handle noisy acoustic conditions in Romanian. Training data was augmented with noise samples from the MUSAN corpus.

The model is paired with our [Romanian VSR models](https://huggingface.co/vsro200/models-vsro200) through **shallow fusion** at decoding time, combining acoustic and visual probabilities during beam search.

For the full AVSR pipeline, fusion implementation, and inference scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).

## Results

Evaluated on a 100-clip subset of the VSRo-200 `test_unseen` split, under two noise types (Gaussian and Babble) and varying signal-to-noise ratios. All values are WER (%); lower is better.

### Gaussian noise

| SNR (dB) | Whisper zero-shot (%) | Whisper fine-tuned (%) | MultiVSR (visual) (%) | Fusion (zero-shot + VSR) (%) | Fusion (fine-tuned + VSR) (%) |
|:---:|:---:|:---:|:---:|:---:|:---:|
| -5  | 90.11 | 73.49 | 47.80 | 80.26 | **38.87** |
|  0  | 69.54 | 40.99 | 47.80 | 51.67 | **26.76** |
|  5  | 47.40 | 24.40 | 47.80 | 34.72 | **17.63** |
| 10  | 33.68 | 15.69 | 47.80 | 23.43 | **14.55** |
| 15  | 25.73 | 13.08 | 47.80 | 18.77 | **12.11** |

### Babble noise

| SNR (dB) | Whisper zero-shot (%) | Whisper fine-tuned (%) | MultiVSR (visual) (%) | Fusion (zero-shot + VSR) (%) | Fusion (fine-tuned + VSR) (%) |
|:---:|:---:|:---:|:---:|:---:|:---:|
| -5  | 93.91 | 137.05 | 47.80 | 83.48 | **38.59** |
|  0  | 73.09 |  50.23 | 47.80 | 56.36 | **34.90** |
|  5  | 45.86 |  19.92 | 47.80 | 30.85 | **18.85** |
| 10  | 28.63 |  14.19 | 47.80 | 23.04 | **13.36** |
| 15  | 22.36 | **12.11** | 47.80 | 17.31 | **12.11** |

At extreme noise levels (e.g., babble at -5 dB), the standalone fine-tuned audio model collapses (137.05% WER). Shallow fusion with the visual stream forces the decoder to rely on lip-reading cues and recovers performance to 38.59% WER, demonstrating the value of multimodal integration in adverse acoustic conditions.

## Training hyperparameters

| Parameter | Value |
|:---|:---|
| Learning rate | 1e-05 |
| Train batch size | 8 |
| Eval batch size | 8 |
| Gradient accumulation steps | 4 |
| Effective batch size | 32 |
| Optimizer | AdamW (torch fused), 尾=(0.9, 0.999), 蔚=1e-08 |
| LR scheduler | Linear, 100 warmup steps |
| Epochs | 3 |
| Mixed precision | Native AMP |
| Seed | 42 |

### Framework versions

Transformers 5.0.0 路 PyTorch 2.10.0+cu128 路 Datasets 4.0.0 路 Tokenizers 0.22.2

## Citation

If you use this model, please cite:

```bibtex
@inproceedings{vsro200,
  title  = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
  author = {...},
  year   = {2026}
}

```

```bibtex
@article{diaconu2026ron3ws,
  title={RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks},
  author={Diaconu, Alexandra and V卯naga, M膬d膬lina and Alexe, Bogdan},
  journal={arXiv preprint arXiv:2603.02368},
  year={2026}
}
```