--- license: apache-2.0 language: - zh - en tags: - tts - speech-evaluation - continuation-score - role-play - reward-model pipeline_tag: text-to-speech ---
Yong Ren*,1,2, Jingbei Li*,1, Haiyang Sun1, Yujie Chen3, Cheng Yi1, Yechang Huang1, Hao Gu2, Ye Bai2, Xuerui Yang1
1StepFun 2University of Chinese Academy of Sciences 3Beihang University
*Equal contribution
📑 Paper | 💻 Code | 📊 Dataset | 🗣️ MCLP-RPTTS Model
## Model Description **MCLP-Score** is the Continuation Score model used to compute the **MCLP (Mean Continuation Log-Probability)** metric. Given a ground-truth audio prefix, this model evaluates how well a generated audio segment continues the stylistic pattern of the ground-truth, producing a log-probability score that measures expressive consistency. The MCLP metric serves as both: 1. **An evaluation metric** for role-play TTS quality (correlation with human MOS: Spearman ρ = 0.94) 2. **A reward signal** for GRPO-based reinforcement learning to improve TTS expressiveness This model is presented in: > **Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability** > *Yong Ren\*, Jingbei Li\*, Haiyang Sun, Yujie Chen, Cheng Yi, Yechang Huang, Hao Gu, Ye Bai, Xuerui Yang* > ICML 2026 ## How MCLP Works The MCLP metric computes the mean log-probability of audio tokens in the **generated segment**, conditioned on a **ground-truth audio prefix**: ``` MCLP = (1/N) * Σ log P(token_i | gt_prefix, token_1, ..., token_{i-1}) ``` Higher MCLP scores indicate better stylistic consistency with the ground-truth speaking style. ## Usage ```bash # Clone the inference code git clone https://github.com/y-ren16/MCLP.git cd MCLP # Compute MCLP scores python compute_contination_score.py \ --model-path /path/to/MCLP-Score \ --audio-dir ./outputs/roleplay_tts \ --gt-jsonl /path/to/WenetSpeech-RP/eval/eval_w_history.jsonl \ --gt-dir /path/to/WenetSpeech-RP/eval/audio \ --save-json mclp_results.json ``` **Output:** ``` MCLP (Mean avg_log_prob): -4.636xxx Mean avg_prob: 0.xxxxx Mean avg_rank: xx.xx ``` For detailed usage instructions, please refer to the [code repository](https://github.com/y-ren16/MCLP). ## Requirements - Python >= 3.10 - PyTorch >= 2.3 with CUDA - GPU: at least 1x A100/H100 (80GB) for inference ```bash pip install transformers==4.49.0 torchaudio librosa onnxruntime s3tokenizer diffusers hyperpyyaml numpy ``` ## Related Resources | Resource | Link | |----------|------| | 📑 Paper | [arXiv:2601.22661](https://arxiv.org/abs/2601.22661) | | 💻 Inference Code | [github.com/y-ren16/MCLP](https://github.com/y-ren16/MCLP) | | 📊 WenetSpeech-RP Dataset | [huggingface.co/datasets/y-ren16/WenetSpeech-RP](https://huggingface.co/datasets/y-ren16/WenetSpeech-RP) | | 🗣️ MCLP-RPTTS Model | [huggingface.co/y-ren16/MCLP-RPTTS](https://huggingface.co/y-ren16/MCLP-RPTTS) | ## Citation ```bibtex @inproceedings{ren2026mclp, title={Evaluating and Rewarding LALMs for Expressive Role-Play TTS via Mean Continuation Log-Probability}, author={Ren, Yong and Li, Jingbei and Sun, Haiyang and Chen, Yujie and Yi, Cheng and Huang, Yechang and Gu, Hao and Bai, Ye and Yang, Xuerui}, booktitle={Proceedings of the 43rd International Conference on Machine Learning (ICML)}, year={2026} } ``` ## License This model is released under the [Apache 2.0 License](LICENSE). ## Acknowledgements This project builds upon: - [Step-Audio 2](https://github.com/stepfun-ai/Step-Audio2) - [CosyVoice](https://github.com/FunAudioLLM/CosyVoice) - [FlashCosyVoice](https://github.com/xingchensong/FlashCosyVoice)