license: apache-2.0
language:
- zh
- en
tags:
- tts
- role-play
- speech-synthesis
- expressive-speech
- grpo
pipeline_tag: text-to-speech
MCLP-RPTTS: Expressive Role-Play TTS Model
Yong Ren*,1,2, Jingbei Li*,1, Haiyang Sun1, Yujie Chen3, Cheng Yi1, Yechang Huang1, Hao Gu2, Ye Bai2, Xuerui Yang1
1StepFun 2University of Chinese Academy of Sciences 3Beihang University
*Equal contribution
π Paper | π» Code | π Dataset | π’ MCLP-Score Model
Model Description
MCLP-RPTTS is a Role-Play Text-to-Speech model fine-tuned from Step-Audio-2-mini using SFT + GRPO with the MCLP (Mean Continuation Log-Probability) reward. It generates expressive speech that is stylistically consistent with role-play instructions including scene descriptions, character profiles, and dialogue history.
This model is presented in:
Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability Yong Ren*, Jingbei Li*, Haiyang Sun, Yujie Chen, Cheng Yi, Yechang Huang, Hao Gu, Ye Bai, Xuerui Yang ICML 2026
Key Results
| Model | CER (%) β | MCLP (W. History) β | MCLP (W/O. History) β | MOS β |
|---|---|---|---|---|
| GPT-Audio | 11.974 | -4.849 | -4.836 | 1.752 |
| MiMo-Audio-7B | 10.605 | -4.753 | -4.745 | 2.471 |
| Step-Audio-2-mini | 3.276 | -4.829 | -4.823 | 1.707 |
| MCLP-RPTTS (Ours) | 1.130 | -4.636 | -4.687 | 3.646 |
Usage
# Clone the inference code
git clone https://github.com/y-ren16/MCLP.git
cd MCLP
# Run role-play TTS inference
python generate_roleplay_stepaudio2_multigpu.py \
--model-path /path/to/MCLP-RPTTS \
--input-jsonl /path/to/WenetSpeech-RP/eval/eval_w_history.jsonl \
--output-dir ./outputs/roleplay_tts \
--audio-base /path/to/extracted_test_audio \
--prompt-base /path/to/WenetSpeech-RP/eval/audio \
--gpus 1
For detailed usage instructions, please refer to the code repository.
Requirements
- Python >= 3.10
- PyTorch >= 2.3 with CUDA
- GPU: at least 1x A100/H100 (80GB) for inference
pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml numpy
Related Resources
| Resource | Link |
|---|---|
| π Paper | arXiv:2601.22661 |
| π» Inference Code | github.com/y-ren16/MCLP |
| π WenetSpeech-RP Dataset | huggingface.co/datasets/y-ren16/WenetSpeech-RP |
| π’ MCLP-Score Model | huggingface.co/y-ren16/MCLP-Score |
Citation
@inproceedings{ren2026mclp,
title={Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability},
author={Ren, Yong and Li, Jingbei and Sun, Haiyang and Chen, Yujie and Yi, Cheng and Huang, Yechang and Gu, Hao and Bai, Ye and Yang, Xuerui},
booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
year={2026}
}
License
This model is released under the Apache 2.0 License.
Acknowledgements
This project builds upon: