Add library_name, paper links, and citation
#1
by nielsr HF Staff - opened
README.md
CHANGED
|
@@ -1,19 +1,28 @@
|
|
| 1 |
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
datasets:
|
| 4 |
- k2-fsa/TTS_eval_datasets
|
| 5 |
language:
|
| 6 |
- en
|
| 7 |
- zh
|
|
|
|
| 8 |
pipeline_tag: text-to-speech
|
|
|
|
| 9 |
---
|
| 10 |
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
- **WER**: Includes [Hubert-based ASR model](https://huggingface.co/facebook/hubert-large-ls960-ft) for LibriSpeech-PC testset, [Paraformer-based ASR model](https://huggingface.co/funasr/paraformer-zh) for Chinese datasets, [Whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) model for general English test sets, [WhisperD](https://huggingface.co/jordand/whisper-d-v1a) model for English dialogue speech.
|
| 14 |
|
| 15 |
-
- **cpWER**: [WhisperD](https://huggingface.co/jordand/whisper-d-v1a) model is used to compute concatenated minimum permutation word error rate
|
| 16 |
-
([cpWER](https://arxiv.org/abs/2507.09318)) for English dialogue speech.
|
| 17 |
|
| 18 |
- **SIM-o**: A [wavlm-based speaker verification model](https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification) is used to compute the speaker similarity between prompt and generated speech.
|
| 19 |
|
|
@@ -21,5 +30,22 @@ This repository contains models for the objective evaluation of text-to-speech (
|
|
| 21 |
|
| 22 |
- **UTMOS**: The mos prediction model [UTMOS](https://github.com/sarulab-speech/UTMOS22) is used.
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
|
|
|
| 2 |
datasets:
|
| 3 |
- k2-fsa/TTS_eval_datasets
|
| 4 |
language:
|
| 5 |
- en
|
| 6 |
- zh
|
| 7 |
+
license: apache-2.0
|
| 8 |
pipeline_tag: text-to-speech
|
| 9 |
+
library_name: transformers
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# TTS Evaluation Models
|
| 13 |
+
|
| 14 |
+
This repository contains models for the objective evaluation of text-to-speech (TTS) models, as presented in the papers [ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching](https://huggingface.co/papers/2506.13053) and [ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching](https://huggingface.co/papers/2507.09318).
|
| 15 |
+
|
| 16 |
+
- **Code:** [k2-fsa/ZipVoice](https://github.com/k2-fsa/ZipVoice)
|
| 17 |
+
- **Project Page:** [ZipVoice-Dialog Demos](https://zipvoice-dialog.github.io)
|
| 18 |
+
|
| 19 |
+
## Evaluation Metrics
|
| 20 |
+
|
| 21 |
+
This repository specifically supports the following evaluation metrics:
|
| 22 |
|
| 23 |
- **WER**: Includes [Hubert-based ASR model](https://huggingface.co/facebook/hubert-large-ls960-ft) for LibriSpeech-PC testset, [Paraformer-based ASR model](https://huggingface.co/funasr/paraformer-zh) for Chinese datasets, [Whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) model for general English test sets, [WhisperD](https://huggingface.co/jordand/whisper-d-v1a) model for English dialogue speech.
|
| 24 |
|
| 25 |
+
- **cpWER**: [WhisperD](https://huggingface.co/jordand/whisper-d-v1a) model is used to compute concatenated minimum permutation word error rate ([cpWER](https://arxiv.org/abs/2507.09318)) for English dialogue speech.
|
|
|
|
| 26 |
|
| 27 |
- **SIM-o**: A [wavlm-based speaker verification model](https://github.com/microsoft/UniSpeech/tree/main/downstreams/speaker_verification) is used to compute the speaker similarity between prompt and generated speech.
|
| 28 |
|
|
|
|
| 30 |
|
| 31 |
- **UTMOS**: The mos prediction model [UTMOS](https://github.com/sarulab-speech/UTMOS22) is used.
|
| 32 |
|
| 33 |
+
For more details, please refer to the [official repository](https://github.com/k2-fsa/ZipVoice).
|
| 34 |
+
|
| 35 |
+
## Citation
|
| 36 |
+
|
| 37 |
+
```bibtex
|
| 38 |
+
@article{zhu2025zipvoice,
|
| 39 |
+
title={ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching},
|
| 40 |
+
author={Zhu, Han and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Li, Zhaoqing and Zhuang, Weiji and Lin, Long and Povey, Daniel},
|
| 41 |
+
journal={arXiv preprint arXiv:2506.13053},
|
| 42 |
+
year={2025}
|
| 43 |
+
}
|
| 44 |
+
|
| 45 |
+
@article{zhu2025zipvoicedialog,
|
| 46 |
+
title={ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching},
|
| 47 |
+
author={Zhu, Han and Kang, Wei and Guo, Liyong and Yao, Zengwei and Kuang, Fangjun and Zhuang, Weiji and Li, Zhaoqing and Han, Zhifeng and Zhang, Dong and Zhang, Xin and Song, Xingchen and Lin, Long and Povey, Daniel},
|
| 48 |
+
journal={arXiv preprint arXiv:2507.09318},
|
| 49 |
+
year={2025}
|
| 50 |
+
}
|
| 51 |
+
```
|