| --- |
| license: apache-2.0 |
| pipeline_tag: automatic-speech-recognition |
| tags: |
| - pytorch |
| - audio |
| - speech |
| - automatic-speech-recognition |
| - whisper |
| - wav2vec2 |
|
|
| model-index: |
| - name: whisper_medium_fp16_transformers |
| results: |
| - task: |
| type: automatic-speech-recognition |
| name: Automatic Speech Recognition |
| dataset: |
| type: librispeech_asr |
| name: LibriSpeech (clean) |
| config: clean |
| split: test |
| args: |
| language: en |
| metrics: |
| - type: wer |
| value: 0 |
| name: Test WER |
| description: Word Error Rate |
| - type: mer |
| value: 0 |
| name: Test MER |
| description: Match Error Rate |
| - type: wil |
| value: 0 |
| name: Test WIL |
| description: Word Information Lost |
| - type: wip |
| value: 0 |
| name: Test WIP |
| description: Word Information Preserved |
| - type: cer |
| value: 0 |
| name: Test CER |
| description: Character Error Rate |
|
|
| - task: |
| type: automatic-speech-recognition |
| name: Automatic Speech Recognition |
| dataset: |
| type: librispeech_asr |
| name: LibriSpeech (other) |
| config: other |
| split: test |
| args: |
| language: en |
| metrics: |
| - type: wer |
| value: 0 |
| name: Test WER |
| description: Word Error Rate |
| - type: mer |
| value: 0 |
| name: Test MER |
| description: Match Error Rate |
| - type: wil |
| value: 0 |
| name: Test WIL |
| description: Word Information Lost |
| - type: wip |
| value: 0 |
| name: Test WIP |
| description: Word Information Preserved |
| - type: cer |
| value: 0 |
| name: Test CER |
| description: Character Error Rate |
| |
| - task: |
| type: automatic-speech-recognition |
| name: Automatic Speech Recognition |
| dataset: |
| type: mozilla-foundation/common_voice_14_0 |
| name: Common Voice (14.0) (Hindi) |
| config: hi |
| split: test |
| args: |
| language: hi |
| metrics: |
| - type: wer |
| value: 54.97 |
| name: Test WER |
| description: Word Error Rate |
| - type: mer |
| value: 47.86 |
| name: Test MER |
| description: Match Error Rate |
| - type: wil |
| value: 66.83 |
| name: Test WIL |
| description: Word Information Lost |
| - type: wip |
| value: 33.16 |
| name: Test WIP |
| description: Word Information Preserved |
| - type: cer |
| value: 30.23 |
| name: Test CER |
| description: Character Error Rate |
|
|
| widget: |
| - example_title: Hinglish Sample |
| src: https://huggingface.co/devasheeshG/whisper_medium_fp16_transformers/resolve/main/test.wav |
| - example_title: Librispeech sample 1 |
| src: https://cdn-media.huggingface.co/speech_samples/sample1.flac |
| - example_title: Librispeech sample 2 |
| src: https://cdn-media.huggingface.co/speech_samples/sample2.flac |
|
|
| language: |
| - en |
| - zh |
| - de |
| - es |
| - ru |
| - ko |
| - fr |
| - ja |
| - pt |
| - tr |
| - pl |
| - ca |
| - nl |
| - ar |
| - sv |
| - it |
| - id |
| - hi |
| - fi |
| - vi |
| - he |
| - uk |
| - el |
| - ms |
| - cs |
| - ro |
| - da |
| - hu |
| - ta |
| - 'no' |
| - th |
| - ur |
| - hr |
| - bg |
| - lt |
| - la |
| - mi |
| - ml |
| - cy |
| - sk |
| - te |
| - fa |
| - lv |
| - bn |
| - sr |
| - az |
| - sl |
| - kn |
| - et |
| - mk |
| - br |
| - eu |
| - is |
| - hy |
| - ne |
| - mn |
| - bs |
| - kk |
| - sq |
| - sw |
| - gl |
| - mr |
| - pa |
| - si |
| - km |
| - sn |
| - yo |
| - so |
| - af |
| - oc |
| - ka |
| - be |
| - tg |
| - sd |
| - gu |
| - am |
| - yi |
| - lo |
| - uz |
| - fo |
| - ht |
| - ps |
| - tk |
| - nn |
| - mt |
| - sa |
| - lb |
| - my |
| - bo |
| - tl |
| - mg |
| - as |
| - tt |
| - haw |
| - ln |
| - ha |
| - ba |
| - jw |
| - su |
| --- |
| ## Versions: |
|
|
| - CUDA: 12.1 |
| - cuDNN Version: 8.9.2.26_1.0-1_amd64 |
|
|
| * tensorflow Version: 2.12.0 |
| * torch Version: 2.1.0.dev20230606+cu12135 |
| * transformers Version: 4.30.2 |
| * accelerate Version: 0.20.3 |
|
|
| ## Model Benchmarks: |
|
|
| - RAM: 2.8 GB (Original_Model: 5.5GB) |
| - VRAM: 1812 MB (Original_Model: 6GB) |
| - test.wav: 23 s (Multilingual Speech i.e. English+Hindi) |
|
|
| - **Time in seconds for Processing by each device** |
|
|
| | Device Name | float32 (Original) | float16 | CudaCores | TensorCores | |
| | ----------------- | ------------------ | ------- | --------- | ----------- | |
| | 3060 | 1.7 | 1.1 | 3,584 | 112 | |
| | 1660 Super | OOM | 3.3 | 1,408 | N/A | |
| | Collab (Tesla T4) | 2.8 | 2.2 | 2,560 | 320 | |
| | Collab (CPU) | 35 | N/A | N/A | N/A | |
| | M1 (CPU) | - | - | - | - | |
| | M1 (GPU -> 'mps') | - | - | - | - | |
|
|
|
|
| - **NOTE: TensorCores are efficient in mixed-precision calculations** |
| - **CPU -> torch.float16 not supported on CPU (AMD Ryzen 5 3600 or Collab CPU)** |
| - Punchuation: True |
|
|
| ## Model Error Benchmarks: |
|
|
| - **WER: Word Error Rate** |
| - **MER: Match Error Rate** |
| - **WIL: Word Information Lost** |
| - **WIP: Word Information Preserved** |
| - **CER: Character Error Rate** |
|
|
| ### Hindi to Hindi (test.tsv) [Common Voice 14.0](https://commonvoice.mozilla.org/en/datasets) |
|
|
| **Test done on RTX 3060 on 2557 Samples** |
|
|
| | | WER | MER | WIL | WIP | CER | |
| | ----------------------- | ----- | ----- | ----- | ----- | ----- | |
| | Original_Model (54 min) | 52.02 | 47.86 | 66.82 | 33.17 | 23.76 | |
| | This_Model (38 min) | 54.97 | 47.86 | 66.83 | 33.16 | 30.23 | |
|
|
| ### Hindi to English (test.tsv) [Common Voice 14.0](https://commonvoice.mozilla.org/en/datasets) |
|
|
| **Test done on RTX 3060 on 1000 Samples** |
|
|
| | | WER | MER | WIL | WIP | CER | |
| | ----------------------- | --- | --- | --- | --- | --- | |
| | Original_Model (30 min) | - | - | - | - | - | |
| | This_Model (20 min) | - | - | - | - | - | |
|
|
| ### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-clean) |
|
|
| **Test done on RTX 3060 on __ Samples** |
|
|
| | | WER | MER | WIL | WIP | CER | |
| | -------------- | --- | --- | --- | --- | --- | |
| | Original_Model | - | - | - | - | - | |
| | This_Model | - | - | - | - | - | |
|
|
| ### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-other) |
|
|
| **Test done on RTX 3060 on __ Samples** |
|
|
| | | WER | MER | WIL | WIP | CER | |
| | -------------- | --- | --- | --- | --- | --- | |
| | Original_Model | - | - | - | - | - | |
| | This_Model | - | - | - | - | - | |
|
|
| - **'jiwer' library is used for calculations** |
|
|
| ## Code for conversion: |
|
|
| - ### [Will be soon Uploaded on Github](https://github.com/devasheeshG) |
|
|
| ## Usage |
|
|
| A file ``__init__.py`` is contained inside this repo which contains all the code to use this model. |
|
|
| Firstly, clone this repo and place all the files inside a folder. |
|
|
| ### Make sure you have git-lfs installed (https://git-lfs.com) |
|
|
| ```bash |
| git lfs install |
| git clone https://huggingface.co/devasheeshG/whisper_medium_fp16_transformers |
| ``` |
|
|
| **Please try in jupyter notebook** |
|
|
| ```python |
| # Import the Model |
| from whisper_medium_fp16_transformers import Model, load_audio, pad_or_trim |
| ``` |
|
|
| ```python |
| # Initilise the model |
| model = Model( |
| model_name_or_path='whisper_medium_fp16_transformers', |
| cuda_visible_device="0", |
| device='cuda', |
| ) |
| ``` |
|
|
| ```python |
| # Load Audio |
| audio = load_audio('whisper_medium_fp16_transformers/test.wav') |
| audio = pad_or_trim(audio) |
| ``` |
|
|
| ```python |
| # Transcribe (First transcription takes time) |
| model.transcribe(audio) |
| ``` |
|
|
| ## Credits |
|
|
| It is fp16 version of ``openai/whisper-medium`` |
|
|