Automatic Speech Recognition
Transformers
Safetensors
Romanian
whisper
Generated from Trainer
speech-recognition
romanian
noisy-speech
avsr
audio-visual-speech-recognition
Instructions to use vsro200/whisper-small-vsro200 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use vsro200/whisper-small-vsro200 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="vsro200/whisper-small-vsro200")# Load model directly from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq processor = AutoProcessor.from_pretrained("vsro200/whisper-small-vsro200") model = AutoModelForSpeechSeq2Seq.from_pretrained("vsro200/whisper-small-vsro200") - Notebooks
- Google Colab
- Kaggle
Upload README.md
Browse files
README.md
CHANGED
|
@@ -1,69 +1,109 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
library_name: transformers
|
|
|
|
| 3 |
base_model: alexandradiaconu/whisper-small-echo-34
|
| 4 |
tags:
|
| 5 |
- generated_from_trainer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
model-index:
|
| 7 |
- name: whisper-small-vsro200
|
| 8 |
results: []
|
| 9 |
---
|
| 10 |
|
| 11 |
-
#
|
| 12 |
|
| 13 |
-
This
|
| 14 |
|
| 15 |
-
|
| 16 |
-
This model is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted specifically to handle challenging acoustic environments in the Romanian language.
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
| 21 |
-
Visual Speech Recognition (VSR) alone is limited by visual ambiguities, while Audio Speech Recognition (ASR) fails in noisy environments.
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
| 26 |
-
The following table demonstrates the effectiveness of our Shallow Fusion pipeline across different noise types (Gaussian and Babble) and Signal-to-Noise Ratio (SNR) levels.
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
|
| 34 |
-
| Gaussian | -5 | 89.43 | 83.16 | 50.52 | 78.68 | **40.92** |
|
| 35 |
-
| Gaussian | 0 | 67.32 | 42.78 | 50.52 | 46.04 | **25.98** |
|
| 36 |
-
| Gaussian | 5 | 48.37 | 24.79 | 50.52 | 37.48 | **19.63** |
|
| 37 |
-
| Gaussian | 10 | 32.43 | 17.34 | 50.52 | 22.93 | **15.66** |
|
| 38 |
-
| Gaussian | 15 | 23.97 | 13.87 | 50.52 | 18.92 | **12.50** |
|
| 39 |
-
| Babble | -5 | 94.95 | 141.78 | 50.52 | 79.65 | **45.61** |
|
| 40 |
-
| Babble | 0 | 73.38 | 49.77 | 50.52 | 56.11 | **29.77** |
|
| 41 |
-
| Babble | 5 | 46.40 | 21.50 | 50.52 | 31.96 | **18.63** |
|
| 42 |
-
| Babble | 10 | 29.24 | **14.22** | 50.52 | 21.86 | 15.16 |
|
| 43 |
-
| Babble | 15 | 22.36 | **11.68** | 50.52 | 18.77 | 12.65 |
|
| 44 |
|
| 45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
-
|
| 48 |
|
| 49 |
-
##
|
| 50 |
|
| 51 |
-
|
| 52 |
-
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
|
| 64 |
### Framework versions
|
| 65 |
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
- ro
|
| 4 |
+
license: apache-2.0
|
| 5 |
library_name: transformers
|
| 6 |
+
pipeline_tag: automatic-speech-recognition
|
| 7 |
base_model: alexandradiaconu/whisper-small-echo-34
|
| 8 |
tags:
|
| 9 |
- generated_from_trainer
|
| 10 |
+
- whisper
|
| 11 |
+
- speech-recognition
|
| 12 |
+
- romanian
|
| 13 |
+
- noisy-speech
|
| 14 |
+
- avsr
|
| 15 |
+
- audio-visual-speech-recognition
|
| 16 |
+
datasets:
|
| 17 |
+
- vsro200/vsro200
|
| 18 |
+
metrics:
|
| 19 |
+
- wer
|
| 20 |
model-index:
|
| 21 |
- name: whisper-small-vsro200
|
| 22 |
results: []
|
| 23 |
---
|
| 24 |
|
| 25 |
+
# Noisy Whisper Small for Romanian AVSR
|
| 26 |
|
| 27 |
+
This is the audio backbone used in the Audio-Visual Speech Recognition (AVSR) pipeline introduced in the paper *VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness*.
|
| 28 |
|
| 29 |
+
It is a fine-tuned version of [`alexandradiaconu/whisper-small-echo-34`](https://huggingface.co/alexandradiaconu/whisper-small-echo-34), adapted to handle noisy acoustic conditions in Romanian. Training data was augmented with noise samples from the MUSAN corpus.
|
|
|
|
| 30 |
|
| 31 |
+
The model is paired with our [Romanian VSR models](https://huggingface.co/vsro200/models-vsro200) through **shallow fusion** at decoding time, combining acoustic and visual probabilities during beam search.
|
| 32 |
|
| 33 |
+
For the full AVSR pipeline, fusion implementation, and inference scripts, please refer to the [GitHub repository](https://github.com/vsro200/vsro200).
|
|
|
|
| 34 |
|
| 35 |
+
## Results
|
| 36 |
|
| 37 |
+
Evaluated on a 100-clip subset of the VSRo-200 `test_unseen` split, under two noise types (Gaussian and Babble) and varying signal-to-noise ratios. All values are WER (%); lower is better.
|
|
|
|
| 38 |
|
| 39 |
+
### Gaussian noise
|
| 40 |
|
| 41 |
+
| SNR (dB) | Whisper zero-shot | Whisper fine-tuned | MultiVSR (visual) | Fusion (zero-shot + VSR) | Fusion (fine-tuned + VSR) |
|
| 42 |
+
|:---:|:---:|:---:|:---:|:---:|:---:|
|
| 43 |
+
| -5 | 89.43 | 83.16 | 50.52 | 78.68 | 40.92 |
|
| 44 |
+
| 0 | 67.32 | 42.78 | 50.52 | 46.04 | 25.98 |
|
| 45 |
+
| 5 | 48.37 | 24.79 | 50.52 | 37.48 | 19.63 |
|
| 46 |
+
| 10 | 32.43 | 17.34 | 50.52 | 22.93 | 15.66 |
|
| 47 |
+
| 15 | 23.97 | 13.87 | 50.52 | 18.92 | 12.50 |
|
| 48 |
|
| 49 |
+
### Babble noise
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
|
| 51 |
+
| SNR (dB) | Whisper zero-shot | Whisper fine-tuned | MultiVSR (visual) | Fusion (zero-shot + VSR) | Fusion (fine-tuned + VSR) |
|
| 52 |
+
|:---:|:---:|:---:|:---:|:---:|:---:|
|
| 53 |
+
| -5 | 94.95 | 141.78 | 50.52 | 79.65 | 45.61 |
|
| 54 |
+
| 0 | 73.38 | 49.77 | 50.52 | 56.11 | 29.77 |
|
| 55 |
+
| 5 | 46.40 | 21.50 | 50.52 | 31.96 | 18.63 |
|
| 56 |
+
| 10 | 29.24 | 14.22 | 50.52 | 21.86 | 15.16 |
|
| 57 |
+
| 15 | 22.36 | 11.68 | 50.52 | 18.77 | 12.65 |
|
| 58 |
|
| 59 |
+
At extreme noise levels (e.g., babble at -5 dB), the standalone fine-tuned audio model collapses (141.78% WER). Shallow fusion with the visual stream forces the decoder to rely on lip-reading cues and recovers performance to 45.61% WER, demonstrating the value of multimodal integration in adverse acoustic conditions.
|
| 60 |
|
| 61 |
+
## Training hyperparameters
|
| 62 |
|
| 63 |
+
| Parameter | Value |
|
| 64 |
+
|:---|:---|
|
| 65 |
+
| Learning rate | 1e-05 |
|
| 66 |
+
| Train batch size | 8 |
|
| 67 |
+
| Eval batch size | 8 |
|
| 68 |
+
| Gradient accumulation steps | 4 |
|
| 69 |
+
| Effective batch size | 32 |
|
| 70 |
+
| Optimizer | AdamW (torch fused), β=(0.9, 0.999), ε=1e-08 |
|
| 71 |
+
| LR scheduler | Linear, 100 warmup steps |
|
| 72 |
+
| Epochs | 3 |
|
| 73 |
+
| Mixed precision | Native AMP |
|
| 74 |
+
| Seed | 42 |
|
| 75 |
|
| 76 |
### Framework versions
|
| 77 |
|
| 78 |
+
Transformers 5.0.0 · PyTorch 2.10.0+cu128 · Datasets 4.0.0 · Tokenizers 0.22.2
|
| 79 |
+
|
| 80 |
+
## Citation
|
| 81 |
+
|
| 82 |
+
If you use this model, please cite:
|
| 83 |
+
|
| 84 |
+
```bibtex
|
| 85 |
+
@inproceedings{vsro200,
|
| 86 |
+
title = {VSRo-200: A Romanian Visual Speech Recognition Dataset for Studying Supervision and Multimodal Robustness},
|
| 87 |
+
author = {...},
|
| 88 |
+
year = {2026}
|
| 89 |
+
}
|
| 90 |
+
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
```bibtex
|
| 94 |
+
@article{diaconu2026ron3ws,
|
| 95 |
+
title={RO-N3WS: Enhancing Generalization in Low-Resource ASR with Diverse Romanian Speech Benchmarks},
|
| 96 |
+
author={Diaconu, Alexandra and Vînaga, Mădălina and Alexe, Bogdan},
|
| 97 |
+
journal={arXiv preprint arXiv:2603.02368},
|
| 98 |
+
year={2026}
|
| 99 |
+
}
|
| 100 |
+
```
|
| 101 |
+
|
| 102 |
+
```bibtex
|
| 103 |
+
@article{radford2022whisper,
|
| 104 |
+
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
|
| 105 |
+
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
|
| 106 |
+
journal = {arXiv preprint arXiv:2212.04356},
|
| 107 |
+
year = {2022}
|
| 108 |
+
}
|
| 109 |
+
```
|