Add Model Card

003e8b1 verified 1 day ago

4.01 kB

	---
	license: apache-2.0
	language:
	- zh
	- en
	tags:
	- tts
	- role-play
	- speech-synthesis
	- expressive-speech
	- grpo
	pipeline_tag: text-to-speech
	---

	<h1 align="center">
	MCLP-RPTTS: Expressive Role-Play TTS Model
	</h1>

	<p align="center">
	Yong Ren<sup>,1,2</sup>, Jingbei Li<sup>,1</sup>, Haiyang Sun<sup>1</sup>, Yujie Chen<sup>3</sup>, Cheng Yi<sup>1</sup>, Yechang Huang<sup>1</sup>, Hao Gu<sup>2</sup>, Ye Bai<sup>2</sup>, Xuerui Yang<sup>1</sup>
	</p>

	<p align="center">
	<sup>1</sup>StepFun   <sup>2</sup>University of Chinese Academy of Sciences   <sup>3</sup>Beihang University
	</p>

	<p align="center">
	<sup>*</sup>Equal contribution
	</p>

	<p align="center">
	📑 <a href="https://arxiv.org/abs/2601.22661">Paper</a>  \|
	💻 <a href="https://github.com/y-ren16/MCLP">Code</a>  \|
	📊 <a href="https://huggingface.co/datasets/y-ren16/WenetSpeech-RP">Dataset</a>  \|
	🔢 <a href="https://huggingface.co/y-ren16/MCLP-Score">MCLP-Score Model</a>
	</p>

	## Model Description

	MCLP-RPTTS is a Role-Play Text-to-Speech model fine-tuned from Step-Audio-2-mini using SFT + GRPO with the MCLP (Mean Continuation Log-Probability) reward. It generates expressive speech that is stylistically consistent with role-play instructions including scene descriptions, character profiles, and dialogue history.

	This model is presented in:

	> Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability
	> Yong Ren\, Jingbei Li\, Haiyang Sun, Yujie Chen, Cheng Yi, Yechang Huang, Hao Gu, Ye Bai, Xuerui Yang
	> ICML 2026

	## Key Results

	\| Model \| CER (%) ↓ \| MCLP (W. History) ↑ \| MCLP (W/O. History) ↑ \| MOS ↑ \|
	\|-------\|-----------\|---------------------\|----------------------\|-------\|
	\| GPT-Audio \| 11.974 \| -4.849 \| -4.836 \| 1.752 \|
	\| MiMo-Audio-7B \| 10.605 \| -4.753 \| -4.745 \| 2.471 \|
	\| Step-Audio-2-mini \| 3.276 \| -4.829 \| -4.823 \| 1.707 \|
	\| MCLP-RPTTS (Ours) \| 1.130 \| -4.636 \| -4.687 \| 3.646 \|

	## Usage

	```bash
	# Clone the inference code
	git clone https://github.com/y-ren16/MCLP.git
	cd MCLP

	# Run role-play TTS inference
	python generate_roleplay_stepaudio2_multigpu.py \
	--model-path /path/to/MCLP-RPTTS \
	--input-jsonl /path/to/WenetSpeech-RP/eval/eval_w_history.jsonl \
	--output-dir ./outputs/roleplay_tts \
	--audio-base /path/to/extracted_test_audio \
	--prompt-base /path/to/WenetSpeech-RP/eval/audio \
	--gpus 1
	```

	For detailed usage instructions, please refer to the [code repository](https://github.com/y-ren16/MCLP).

	## Requirements

	- Python >= 3.10
	- PyTorch >= 2.3 with CUDA
	- GPU: at least 1x A100/H100 (80GB) for inference

	```bash
	pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml numpy
	```

	## Related Resources

	\| Resource \| Link \|
	\|----------\|------\|
	\| 📑 Paper \| [arXiv:2601.22661](https://arxiv.org/abs/2601.22661) \|
	\| 💻 Inference Code \| [github.com/y-ren16/MCLP](https://github.com/y-ren16/MCLP) \|
	\| 📊 WenetSpeech-RP Dataset \| [huggingface.co/datasets/y-ren16/WenetSpeech-RP](https://huggingface.co/datasets/y-ren16/WenetSpeech-RP) \|
	\| 🔢 MCLP-Score Model \| [huggingface.co/y-ren16/MCLP-Score](https://huggingface.co/y-ren16/MCLP-Score) \|

	## Citation

	```bibtex
	@inproceedings{ren2026mclp,
	title={Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability},
	author={Ren, Yong and Li, Jingbei and Sun, Haiyang and Chen, Yujie and Yi, Cheng and Huang, Yechang and Gu, Hao and Bai, Ye and Yang, Xuerui},
	booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)},
	year={2026}
	}
	```

	## License

	This model is released under the [Apache 2.0 License](LICENSE).

	## Acknowledgements

	This project builds upon:
	- [Step-Audio 2](https://github.com/stepfun-ai/Step-Audio2)
	- [CosyVoice](https://github.com/FunAudioLLM/CosyVoice)
	- [FlashCosyVoice](https://github.com/xingchensong/FlashCosyVoice)