y-ren16 commited on
Commit
003e8b1
Β·
verified Β·
1 Parent(s): ce7ecc1

Add Model Card

Browse files
Files changed (1) hide show
  1. README.md +112 -0
README.md CHANGED
@@ -1,3 +1,115 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - zh
5
+ - en
6
+ tags:
7
+ - tts
8
+ - role-play
9
+ - speech-synthesis
10
+ - expressive-speech
11
+ - grpo
12
+ pipeline_tag: text-to-speech
13
  ---
14
+
15
+ <h1 align="center">
16
+ MCLP-RPTTS: Expressive Role-Play TTS Model
17
+ </h1>
18
+
19
+ <p align="center">
20
+ Yong Ren<sup>*,1,2</sup>, Jingbei Li<sup>*,1</sup>, Haiyang Sun<sup>1</sup>, Yujie Chen<sup>3</sup>, Cheng Yi<sup>1</sup>, Yechang Huang<sup>1</sup>, Hao Gu<sup>2</sup>, Ye Bai<sup>2</sup>, Xuerui Yang<sup>1</sup>
21
+ </p>
22
+
23
+ <p align="center">
24
+ <sup>1</sup>StepFun &nbsp; <sup>2</sup>University of Chinese Academy of Sciences &nbsp; <sup>3</sup>Beihang University
25
+ </p>
26
+
27
+ <p align="center">
28
+ <sup>*</sup>Equal contribution
29
+ </p>
30
+
31
+ <p align="center">
32
+ πŸ“‘ <a href="https://arxiv.org/abs/2601.22661">Paper</a> &nbsp;|&nbsp;
33
+ πŸ’» <a href="https://github.com/y-ren16/MCLP">Code</a> &nbsp;|&nbsp;
34
+ πŸ“Š <a href="https://huggingface.co/datasets/y-ren16/WenetSpeech-RP">Dataset</a> &nbsp;|&nbsp;
35
+ πŸ”’ <a href="https://huggingface.co/y-ren16/MCLP-Score">MCLP-Score Model</a>
36
+ </p>
37
+
38
+ ## Model Description
39
+
40
+ **MCLP-RPTTS** is a Role-Play Text-to-Speech model fine-tuned from Step-Audio-2-mini using SFT + GRPO with the MCLP (Mean Continuation Log-Probability) reward. It generates expressive speech that is stylistically consistent with role-play instructions including scene descriptions, character profiles, and dialogue history.
41
+
42
+ This model is presented in:
43
+
44
+ > **Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability**
45
+ > *Yong Ren\*, Jingbei Li\*, Haiyang Sun, Yujie Chen, Cheng Yi, Yechang Huang, Hao Gu, Ye Bai, Xuerui Yang*
46
+ > ICML 2026
47
+
48
+ ## Key Results
49
+
50
+ | Model | CER (%) ↓ | MCLP (W. History) ↑ | MCLP (W/O. History) ↑ | MOS ↑ |
51
+ |-------|-----------|---------------------|----------------------|-------|
52
+ | GPT-Audio | 11.974 | -4.849 | -4.836 | 1.752 |
53
+ | MiMo-Audio-7B | 10.605 | -4.753 | -4.745 | 2.471 |
54
+ | Step-Audio-2-mini | 3.276 | -4.829 | -4.823 | 1.707 |
55
+ | **MCLP-RPTTS (Ours)** | **1.130** | **-4.636** | **-4.687** | **3.646** |
56
+
57
+ ## Usage
58
+
59
+ ```bash
60
+ # Clone the inference code
61
+ git clone https://github.com/y-ren16/MCLP.git
62
+ cd MCLP
63
+
64
+ # Run role-play TTS inference
65
+ python generate_roleplay_stepaudio2_multigpu.py \
66
+ --model-path /path/to/MCLP-RPTTS \
67
+ --input-jsonl /path/to/WenetSpeech-RP/eval/eval_w_history.jsonl \
68
+ --output-dir ./outputs/roleplay_tts \
69
+ --audio-base /path/to/extracted_test_audio \
70
+ --prompt-base /path/to/WenetSpeech-RP/eval/audio \
71
+ --gpus 1
72
+ ```
73
+
74
+ For detailed usage instructions, please refer to the [code repository](https://github.com/y-ren16/MCLP).
75
+
76
+ ## Requirements
77
+
78
+ - Python >= 3.10
79
+ - PyTorch >= 2.3 with CUDA
80
+ - GPU: at least 1x A100/H100 (80GB) for inference
81
+
82
+ ```bash
83
+ pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml numpy
84
+ ```
85
+
86
+ ## Related Resources
87
+
88
+ | Resource | Link |
89
+ |----------|------|
90
+ | πŸ“‘ Paper | [arXiv:2601.22661](https://arxiv.org/abs/2601.22661) |
91
+ | πŸ’» Inference Code | [github.com/y-ren16/MCLP](https://github.com/y-ren16/MCLP) |
92
+ | πŸ“Š WenetSpeech-RP Dataset | [huggingface.co/datasets/y-ren16/WenetSpeech-RP](https://huggingface.co/datasets/y-ren16/WenetSpeech-RP) |
93
+ | πŸ”’ MCLP-Score Model | [huggingface.co/y-ren16/MCLP-Score](https://huggingface.co/y-ren16/MCLP-Score) |
94
+
95
+ ## Citation
96
+
97
+ ```bibtex
98
+ @inproceedings{ren2026mclp,
99
+ title={Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability},
100
+ author={Ren, Yong and Li, Jingbei and Sun, Haiyang and Chen, Yujie and Yi, Cheng and Huang, Yechang and Gu, Hao and Bai, Ye and Yang, Xuerui},
101
+ booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
102
+ year={2026}
103
+ }
104
+ ```
105
+
106
+ ## License
107
+
108
+ This model is released under the [Apache 2.0 License](LICENSE).
109
+
110
+ ## Acknowledgements
111
+
112
+ This project builds upon:
113
+ - [Step-Audio 2](https://github.com/stepfun-ai/Step-Audio2)
114
+ - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
115
+ - [FlashCosyVoice](https://github.com/xingchensong/FlashCosyVoice)