File size: 4,007 Bytes
784a6d7
 
003e8b1
 
 
 
 
 
 
 
 
 
784a6d7
003e8b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
license: apache-2.0
language:
  - zh
  - en
tags:
  - tts
  - role-play
  - speech-synthesis
  - expressive-speech
  - grpo
pipeline_tag: text-to-speech
---

<h1 align="center">
  MCLP-RPTTS: Expressive Role-Play TTS Model
</h1>

<p align="center">
  Yong Ren<sup>*,1,2</sup>, Jingbei Li<sup>*,1</sup>, Haiyang Sun<sup>1</sup>, Yujie Chen<sup>3</sup>, Cheng Yi<sup>1</sup>, Yechang Huang<sup>1</sup>, Hao Gu<sup>2</sup>, Ye Bai<sup>2</sup>, Xuerui Yang<sup>1</sup>
</p>

<p align="center">
  <sup>1</sup>StepFun &nbsp; <sup>2</sup>University of Chinese Academy of Sciences &nbsp; <sup>3</sup>Beihang University
</p>

<p align="center">
  <sup>*</sup>Equal contribution
</p>

<p align="center">
πŸ“‘ <a href="https://arxiv.org/abs/2601.22661">Paper</a> &nbsp;|&nbsp;
πŸ’» <a href="https://github.com/y-ren16/MCLP">Code</a> &nbsp;|&nbsp;
πŸ“Š <a href="https://huggingface.co/datasets/y-ren16/WenetSpeech-RP">Dataset</a> &nbsp;|&nbsp;
πŸ”’ <a href="https://huggingface.co/y-ren16/MCLP-Score">MCLP-Score Model</a>
</p>

## Model Description

**MCLP-RPTTS** is a Role-Play Text-to-Speech model fine-tuned from Step-Audio-2-mini using SFT + GRPO with the MCLP (Mean Continuation Log-Probability) reward. It generates expressive speech that is stylistically consistent with role-play instructions including scene descriptions, character profiles, and dialogue history.

This model is presented in:

> **Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability**
> *Yong Ren\*, Jingbei Li\*, Haiyang Sun, Yujie Chen, Cheng Yi, Yechang Huang, Hao Gu, Ye Bai, Xuerui Yang*
> ICML 2026

## Key Results

| Model | CER (%) ↓ | MCLP (W. History) ↑ | MCLP (W/O. History) ↑ | MOS ↑ |
|-------|-----------|---------------------|----------------------|-------|
| GPT-Audio | 11.974 | -4.849 | -4.836 | 1.752 |
| MiMo-Audio-7B | 10.605 | -4.753 | -4.745 | 2.471 |
| Step-Audio-2-mini | 3.276 | -4.829 | -4.823 | 1.707 |
| **MCLP-RPTTS (Ours)** | **1.130** | **-4.636** | **-4.687** | **3.646** |

## Usage

```bash
# Clone the inference code
git clone https://github.com/y-ren16/MCLP.git
cd MCLP

# Run role-play TTS inference
python generate_roleplay_stepaudio2_multigpu.py \
    --model-path /path/to/MCLP-RPTTS \
    --input-jsonl /path/to/WenetSpeech-RP/eval/eval_w_history.jsonl \
    --output-dir ./outputs/roleplay_tts \
    --audio-base /path/to/extracted_test_audio \
    --prompt-base /path/to/WenetSpeech-RP/eval/audio \
    --gpus 1
```

For detailed usage instructions, please refer to the [code repository](https://github.com/y-ren16/MCLP).

## Requirements

- Python >= 3.10
- PyTorch >= 2.3 with CUDA
- GPU: at least 1x A100/H100 (80GB) for inference

```bash
pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml numpy
```

## Related Resources

| Resource | Link |
|----------|------|
| πŸ“‘ Paper | [arXiv:2601.22661](https://arxiv.org/abs/2601.22661) |
| πŸ’» Inference Code | [github.com/y-ren16/MCLP](https://github.com/y-ren16/MCLP) |
| πŸ“Š WenetSpeech-RP Dataset | [huggingface.co/datasets/y-ren16/WenetSpeech-RP](https://huggingface.co/datasets/y-ren16/WenetSpeech-RP) |
| πŸ”’ MCLP-Score Model | [huggingface.co/y-ren16/MCLP-Score](https://huggingface.co/y-ren16/MCLP-Score) |

## Citation

```bibtex
@inproceedings{ren2026mclp,
  title={Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability},
  author={Ren, Yong and Li, Jingbei and Sun, Haiyang and Chen, Yujie and Yi, Cheng and Huang, Yechang and Gu, Hao and Bai, Ye and Yang, Xuerui},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year={2026}
}
```

## License

This model is released under the [Apache 2.0 License](LICENSE).

## Acknowledgements

This project builds upon:
- [Step-Audio 2](https://github.com/stepfun-ai/Step-Audio2)
- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
- [FlashCosyVoice](https://github.com/xingchensong/FlashCosyVoice)