---
language:
- ro
library_name: transformers
pipeline_tag: automatic-speech-recognition
base_model: alexandradiaconu/whisper-small-echo-34
tags:
- generated_from_trainer
- whisper
- speech-recognition
- romanian
- noisy-speech
- avsr
- audio-visual-speech-recognition
datasets:
- vsro200/vsro200
metrics:
- wer
model-index:
- name: whisper-small-vsro200
  results: []
---

# Noisy Whisper Small for Romanian AVSR

This is the audio backbone used in the Audio-Visual Speech Recognition (AVSR) pipeline introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.

It is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted to handle noisy acoustic conditions in Romanian. Training data was augmented with noise samples from the MUSAN corpus.

The model is paired with our [Romanian VSR models](https://huggingface.co/vsro200/models-vsro200) through **shallow fusion** at decoding time, combining acoustic and visual probabilities during beam search.

For the full AVSR pipeline, fusion implementation, and inference scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).

## Results

Evaluated on a 100-clip subset of the VSRo-200 `test_unseen` split, under two noise types (Gaussian and Babble) and varying signal-to-noise ratios. All values are WER (%); lower is better.

### Gaussian noise

| SNR (dB) | Whisper zero-shot (%) | Whisper fine-tuned (%) | MultiVSR (visual) (%) | Fusion (zero-shot + VSR) (%) | Fusion (fine-tuned + VSR) (%) |
|:---:|:---:|:---:|:---:|:---:|:---:|
| -5  | 90.11 | 73.49 | 47.80 | 80.26 | **38.87** |
|  0  | 69.54 | 40.99 | 47.80 | 51.67 | **26.76** |
|  5  | 47.40 | 24.40 | 47.80 | 34.72 | **17.63** |
| 10  | 33.68 | 15.69 | 47.80 | 23.43 | **14.55** |
| 15  | 25.73 | 13.08 | 47.80 | 18.77 | **12.11** |

### Babble noise

| SNR (dB) | Whisper zero-shot (%) | Whisper fine-tuned (%) | MultiVSR (visual) (%) | Fusion (zero-shot + VSR) (%) | Fusion (fine-tuned + VSR) (%) |
|:---:|:---:|:---:|:---:|:---:|:---:|
| -5  | 93.91 | 137.05 | 47.80 | 83.48 | **38.59** |
|  0  | 73.09 |  50.23 | 47.80 | 56.36 | **34.90** |
|  5  | 45.86 |  19.92 | 47.80 | 30.85 | **18.85** |
| 10  | 28.63 |  14.19 | 47.80 | 23.04 | **13.36** |
| 15  | 22.36 | **12.11** | 47.80 | 17.31 | **12.11** |

At extreme noise levels (e.g., babble at -5 dB), the standalone fine-tuned audio model collapses (137.05% WER). Shallow fusion with the visual stream forces the decoder to rely on lip-reading cues and recovers performance to 38.59% WER, demonstrating the value of multimodal integration in adverse acoustic conditions.

## Training hyperparameters

| Parameter | Value |
|:---|:---|
| Learning rate | 1e-05 |
| Train batch size | 8 |
| Eval batch size | 8 |
| Gradient accumulation steps | 4 |
| Effective batch size | 32 |
| Optimizer | AdamW (torch fused), β=(0.9, 0.999), ε=1e-08 |
| LR scheduler | Linear, 100 warmup steps |
| Epochs | 3 |
| Mixed precision | Native AMP |
| Seed | 42 |

### Framework versions

Transformers 5.0.0 · PyTorch 2.10.0+cu128 · Datasets 4.0.0 · Tokenizers 0.22.2

## Citation

If you use this model, please cite:

```bibtex
@inproceedings{vsro200,
  title  = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
  author = {...},
  year   = {2026}
}

```

```bibtex
@article{diaconu2026ron3ws,
  title={RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks},
  author={Diaconu, Alexandra and Vînaga, Mădălina and Alexe, Bogdan},
  journal={arXiv preprint arXiv:2603.02368},
  year={2026}
}
```