--- license: apache-2.0 language: - zh - en tags: - tts - role-play - speech-synthesis - expressive-speech - grpo pipeline_tag: text-to-speech ---

MCLP-RPTTS: Expressive Role-Play TTS Model

Yong Ren*,1,2, Jingbei Li*,1, Haiyang Sun1, Yujie Chen3, Cheng Yi1, Yechang Huang1, Hao Gu2, Ye Bai2, Xuerui Yang1

1StepFun   2University of Chinese Academy of Sciences   3Beihang University

*Equal contribution

📑 Paper  |  💻 Code  |  📊 Dataset  |  🔢 MCLP-Score Model

## Model Description **MCLP-RPTTS** is a Role-Play Text-to-Speech model fine-tuned from Step-Audio-2-mini using SFT + GRPO with the MCLP (Mean Continuation Log-Probability) reward. It generates expressive speech that is stylistically consistent with role-play instructions including scene descriptions, character profiles, and dialogue history. This model is presented in: > **Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability** > *Yong Ren\*, Jingbei Li\*, Haiyang Sun, Yujie Chen, Cheng Yi, Yechang Huang, Hao Gu, Ye Bai, Xuerui Yang* > ICML 2026 ## Key Results | Model | CER (%) ↓ | MCLP (W. History) ↑ | MCLP (W/O. History) ↑ | MOS ↑ | |-------|-----------|---------------------|----------------------|-------| | GPT-Audio | 11.974 | -4.849 | -4.836 | 1.752 | | MiMo-Audio-7B | 10.605 | -4.753 | -4.745 | 2.471 | | Step-Audio-2-mini | 3.276 | -4.829 | -4.823 | 1.707 | | **MCLP-RPTTS (Ours)** | **1.130** | **-4.636** | **-4.687** | **3.646** | ## Usage ```bash # Clone the inference code git clone https://github.com/y-ren16/MCLP.git cd MCLP # Run role-play TTS inference python generate_roleplay_stepaudio2_multigpu.py \ --model-path /path/to/MCLP-RPTTS \ --input-jsonl /path/to/WenetSpeech-RP/eval/eval_w_history.jsonl \ --output-dir ./outputs/roleplay_tts \ --audio-base /path/to/extracted_test_audio \ --prompt-base /path/to/WenetSpeech-RP/eval/audio \ --gpus 1 ``` For detailed usage instructions, please refer to the [code repository](https://github.com/y-ren16/MCLP). ## Requirements - Python >= 3.10 - PyTorch >= 2.3 with CUDA - GPU: at least 1x A100/H100 (80GB) for inference ```bash pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml numpy ``` ## Related Resources | Resource | Link | |----------|------| | 📑 Paper | [arXiv:2601.22661](https://arxiv.org/abs/2601.22661) | | 💻 Inference Code | [github.com/y-ren16/MCLP](https://github.com/y-ren16/MCLP) | | 📊 WenetSpeech-RP Dataset | [huggingface.co/datasets/y-ren16/WenetSpeech-RP](https://huggingface.co/datasets/y-ren16/WenetSpeech-RP) | | 🔢 MCLP-Score Model | [huggingface.co/y-ren16/MCLP-Score](https://huggingface.co/y-ren16/MCLP-Score) | ## Citation ```bibtex @inproceedings{ren2026mclp, title={Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability}, author={Ren, Yong and Li, Jingbei and Sun, Haiyang and Chen, Yujie and Yi, Cheng and Huang, Yechang and Gu, Hao and Bai, Ye and Yang, Xuerui}, booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)}, year={2026} } ``` ## License This model is released under the [Apache 2.0 License](LICENSE). ## Acknowledgements This project builds upon: - [Step-Audio 2](https://github.com/stepfun-ai/Step-Audio2) - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) - [FlashCosyVoice](https://github.com/xingchensong/FlashCosyVoice)