MCLP-RPTTS: Expressive Role-Play TTS Model

Yong Ren^*,1,2, Jingbei Li^*,1, Haiyang Sun¹, Yujie Chen³, Cheng Yi¹, Yechang Huang¹, Hao Gu², Ye Bai², Xuerui Yang¹

¹StepFun ²University of Chinese Academy of Sciences ³Beihang University

^*Equal contribution

📑 Paper | 💻 Code | 📊 Dataset | 🔢 MCLP-Score Model

Model Description

MCLP-RPTTS is a Role-Play Text-to-Speech model fine-tuned from Step-Audio-2-mini using SFT + GRPO with the MCLP (Mean Continuation Log-Probability) reward. It generates expressive speech that is stylistically consistent with role-play instructions including scene descriptions, character profiles, and dialogue history.

This model is presented in:

Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability Yong Ren*, Jingbei Li*, Haiyang Sun, Yujie Chen, Cheng Yi, Yechang Huang, Hao Gu, Ye Bai, Xuerui Yang ICML 2026

Key Results

Model	CER (%) ↓	MCLP (W. History) ↑	MCLP (W/O. History) ↑	MOS ↑
GPT-Audio	11.974	-4.849	-4.836	1.752
MiMo-Audio-7B	10.605	-4.753	-4.745	2.471
Step-Audio-2-mini	3.276	-4.829	-4.823	1.707
MCLP-RPTTS (Ours)	1.130	-4.636	-4.687	3.646

Usage

# Clone the inference code
git clone https://github.com/y-ren16/MCLP.git
cd MCLP

# Run role-play TTS inference
python generate_roleplay_stepaudio2_multigpu.py \
    --model-path /path/to/MCLP-RPTTS \
    --input-jsonl /path/to/WenetSpeech-RP/eval/eval_w_history.jsonl \
    --output-dir ./outputs/roleplay_tts \
    --audio-base /path/to/extracted_test_audio \
    --prompt-base /path/to/WenetSpeech-RP/eval/audio \
    --gpus 1

For detailed usage instructions, please refer to the code repository.

Requirements

Python >= 3.10
PyTorch >= 2.3 with CUDA
GPU: at least 1x A100/H100 (80GB) for inference

pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml numpy

Related Resources

Resource	Link
📑 Paper	arXiv:2601.22661
💻 Inference Code	github.com/y-ren16/MCLP
📊 WenetSpeech-RP Dataset	huggingface.co/datasets/y-ren16/WenetSpeech-RP
🔢 MCLP-Score Model	huggingface.co/y-ren16/MCLP-Score

Citation

@inproceedings{ren2026mclp,
  title={Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability},
  author={Ren, Yong and Li, Jingbei and Sun, Haiyang and Chen, Yujie and Yi, Cheng and Huang, Yechang and Gu, Hao and Bai, Ye and Yang, Xuerui},
  booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year={2026}
}

License

This model is released under the Apache 2.0 License.

Acknowledgements

This project builds upon:

Downloads last month: 27

Safetensors

Model size

8B params

Tensor type

BF16

Paper for y-ren16/MCLP-RPTTS

Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability

Paper • 2601.22661 • Published Jan 30