Add Model Card
Browse files
README.md
CHANGED
|
@@ -1,3 +1,115 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- zh
|
| 5 |
+
- en
|
| 6 |
+
tags:
|
| 7 |
+
- tts
|
| 8 |
+
- role-play
|
| 9 |
+
- speech-synthesis
|
| 10 |
+
- expressive-speech
|
| 11 |
+
- grpo
|
| 12 |
+
pipeline_tag: text-to-speech
|
| 13 |
---
|
| 14 |
+
|
| 15 |
+
<h1 align="center">
|
| 16 |
+
MCLP-RPTTS: Expressive Role-Play TTS Model
|
| 17 |
+
</h1>
|
| 18 |
+
|
| 19 |
+
<p align="center">
|
| 20 |
+
Yong Ren<sup>*,1,2</sup>, Jingbei Li<sup>*,1</sup>, Haiyang Sun<sup>1</sup>, Yujie Chen<sup>3</sup>, Cheng Yi<sup>1</sup>, Yechang Huang<sup>1</sup>, Hao Gu<sup>2</sup>, Ye Bai<sup>2</sup>, Xuerui Yang<sup>1</sup>
|
| 21 |
+
</p>
|
| 22 |
+
|
| 23 |
+
<p align="center">
|
| 24 |
+
<sup>1</sup>StepFun <sup>2</sup>University of Chinese Academy of Sciences <sup>3</sup>Beihang University
|
| 25 |
+
</p>
|
| 26 |
+
|
| 27 |
+
<p align="center">
|
| 28 |
+
<sup>*</sup>Equal contribution
|
| 29 |
+
</p>
|
| 30 |
+
|
| 31 |
+
<p align="center">
|
| 32 |
+
π <a href="https://arxiv.org/abs/2601.22661">Paper</a> |
|
| 33 |
+
π» <a href="https://github.com/y-ren16/MCLP">Code</a> |
|
| 34 |
+
π <a href="https://huggingface.co/datasets/y-ren16/WenetSpeech-RP">Dataset</a> |
|
| 35 |
+
π’ <a href="https://huggingface.co/y-ren16/MCLP-Score">MCLP-Score Model</a>
|
| 36 |
+
</p>
|
| 37 |
+
|
| 38 |
+
## Model Description
|
| 39 |
+
|
| 40 |
+
**MCLP-RPTTS** is a Role-Play Text-to-Speech model fine-tuned from Step-Audio-2-mini using SFT + GRPO with the MCLP (Mean Continuation Log-Probability) reward. It generates expressive speech that is stylistically consistent with role-play instructions including scene descriptions, character profiles, and dialogue history.
|
| 41 |
+
|
| 42 |
+
This model is presented in:
|
| 43 |
+
|
| 44 |
+
> **Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability**
|
| 45 |
+
> *Yong Ren\*, Jingbei Li\*, Haiyang Sun, Yujie Chen, Cheng Yi, Yechang Huang, Hao Gu, Ye Bai, Xuerui Yang*
|
| 46 |
+
> ICML 2026
|
| 47 |
+
|
| 48 |
+
## Key Results
|
| 49 |
+
|
| 50 |
+
| Model | CER (%) β | MCLP (W. History) β | MCLP (W/O. History) β | MOS β |
|
| 51 |
+
|-------|-----------|---------------------|----------------------|-------|
|
| 52 |
+
| GPT-Audio | 11.974 | -4.849 | -4.836 | 1.752 |
|
| 53 |
+
| MiMo-Audio-7B | 10.605 | -4.753 | -4.745 | 2.471 |
|
| 54 |
+
| Step-Audio-2-mini | 3.276 | -4.829 | -4.823 | 1.707 |
|
| 55 |
+
| **MCLP-RPTTS (Ours)** | **1.130** | **-4.636** | **-4.687** | **3.646** |
|
| 56 |
+
|
| 57 |
+
## Usage
|
| 58 |
+
|
| 59 |
+
```bash
|
| 60 |
+
# Clone the inference code
|
| 61 |
+
git clone https://github.com/y-ren16/MCLP.git
|
| 62 |
+
cd MCLP
|
| 63 |
+
|
| 64 |
+
# Run role-play TTS inference
|
| 65 |
+
python generate_roleplay_stepaudio2_multigpu.py \
|
| 66 |
+
--model-path /path/to/MCLP-RPTTS \
|
| 67 |
+
--input-jsonl /path/to/WenetSpeech-RP/eval/eval_w_history.jsonl \
|
| 68 |
+
--output-dir ./outputs/roleplay_tts \
|
| 69 |
+
--audio-base /path/to/extracted_test_audio \
|
| 70 |
+
--prompt-base /path/to/WenetSpeech-RP/eval/audio \
|
| 71 |
+
--gpus 1
|
| 72 |
+
```
|
| 73 |
+
|
| 74 |
+
For detailed usage instructions, please refer to the [code repository](https://github.com/y-ren16/MCLP).
|
| 75 |
+
|
| 76 |
+
## Requirements
|
| 77 |
+
|
| 78 |
+
- Python >= 3.10
|
| 79 |
+
- PyTorch >= 2.3 with CUDA
|
| 80 |
+
- GPU: at least 1x A100/H100 (80GB) for inference
|
| 81 |
+
|
| 82 |
+
```bash
|
| 83 |
+
pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml numpy
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
## Related Resources
|
| 87 |
+
|
| 88 |
+
| Resource | Link |
|
| 89 |
+
|----------|------|
|
| 90 |
+
| π Paper | [arXiv:2601.22661](https://arxiv.org/abs/2601.22661) |
|
| 91 |
+
| π» Inference Code | [github.com/y-ren16/MCLP](https://github.com/y-ren16/MCLP) |
|
| 92 |
+
| π WenetSpeech-RP Dataset | [huggingface.co/datasets/y-ren16/WenetSpeech-RP](https://huggingface.co/datasets/y-ren16/WenetSpeech-RP) |
|
| 93 |
+
| π’ MCLP-Score Model | [huggingface.co/y-ren16/MCLP-Score](https://huggingface.co/y-ren16/MCLP-Score) |
|
| 94 |
+
|
| 95 |
+
## Citation
|
| 96 |
+
|
| 97 |
+
```bibtex
|
| 98 |
+
@inproceedings{ren2026mclp,
|
| 99 |
+
title={Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability},
|
| 100 |
+
author={Ren, Yong and Li, Jingbei and Sun, Haiyang and Chen, Yujie and Yi, Cheng and Huang, Yechang and Gu, Hao and Bai, Ye and Yang, Xuerui},
|
| 101 |
+
booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
|
| 102 |
+
year={2026}
|
| 103 |
+
}
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
## License
|
| 107 |
+
|
| 108 |
+
This model is released under the [Apache 2.0 License](LICENSE).
|
| 109 |
+
|
| 110 |
+
## Acknowledgements
|
| 111 |
+
|
| 112 |
+
This project builds upon:
|
| 113 |
+
- [Step-Audio 2](https://github.com/stepfun-ai/Step-Audio2)
|
| 114 |
+
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
|
| 115 |
+
- [FlashCosyVoice](https://github.com/xingchensong/FlashCosyVoice)
|